This tutorial makes use of the following R package(s): dplyr, tidyr and stringr

The following data are used in some of the subsequent tutorials (including the one on ggplot2) and make use of some advanced data manipulation routines. The input data file formats are provided as is by their source and are modified to facilitate ingestion into some the plotting routines covered in later exercises. Data used in this tutorial include grain harvest for north america and income/education census data for the US.

dat1 <- read.csv("http://mgimond.github.io/ES218/Data/FAO_grains_NA.csv", header=TRUE)
dat2 <- read.csv("http://mgimond.github.io/ES218/Data/Income_education.csv", header=TRUE)

Dataset dat1

dat1 consists of grain yields by north american countries and by year. The dataset was downloaded from http://faostat3.fao.org/ in June of 2014.

A subset of dat1 will be used in later tutorials in both a wide form and a long form. The wide form will be called dat1w and will be a table of year vs. crop yields.

library(dplyr)
library(tidyr)

dat1w <- dat1 %>%
         filter(Information == "Yield (Hg/Ha)", 
                Country     =="United States of America", 
                Crop       %in% c("Oats", "Maize", "Barley", "Buckwheat","Rye")) %>%
         select(Year, Crop, Value)                                          %>%
         spread(key = Crop, value="Value")
head(dat1w)
  Year   Barley Buckwheat    Maize     Oats      Rye
1 1961 16488.52  10886.67 39183.63 15171.26 11121.79
2 1962 18839.00  11737.50 40620.80 16224.60 12892.77
3 1963 18808.27  11995.00 42595.55 16253.04 11524.11
4 1964 20208.88  11566.50 39498.36 15471.55 12026.24
5 1965 23070.58  11875.00 46492.14 18001.04 14192.91
6 1966 20581.66  11956.52 45891.88 16117.92 13670.75

The long form version of the subset will be called dat1l and will be a long form representation of dat1w (yield by crop and year).

dat1l <- gather(dat1w, key = "Crop", value = "Yield", 2:6)
head(dat1l,10)
   Year   Crop    Yield
1  1961 Barley 16488.52
2  1962 Barley 18839.00
3  1963 Barley 18808.27
4  1964 Barley 20208.88
5  1965 Barley 23070.58
6  1966 Barley 20581.66
7  1967 Barley 21785.30
8  1968 Barley 23557.07
9  1969 Barley 24039.46
10 1970 Barley 23048.77

Another subset will be used in subsequent exercises and will consist of total yields for each year by crop and country.

dat1l2 <- dat1 %>%
          filter(Information == "Yield (Hg/Ha)", 
                 Crop       %in% c("Oats", "Maize", "Barley", "Buckwheat","Rye")) %>%
          select( Year, Crop, Country,  Yield = Value)  # Note that we are renaming the Value field

head(dat1l2,15)
   Year      Crop                  Country    Yield
1  2012    Barley                   Canada 38894.66
2  2012     Maize                   Canada 83611.49
3  2012      Oats                   Canada 24954.79
4  2012       Rye                   Canada 38056.86
5  2012    Barley United States of America 36533.24
6  2012 Buckwheat United States of America 10445.86
7  2012     Maize United States of America 77441.67
8  2012      Oats United States of America 21974.70
9  2012       Rye United States of America 17575.73
10 2011    Barley                   Canada 32796.43
11 2011     Maize                   Canada 88946.49
12 2011      Oats                   Canada 29109.36
13 2011       Rye                   Canada 24676.81
14 2011    Barley United States of America 37431.96
15 2011 Buckwheat United States of America 10299.05

Dataset dat2

dat2 consists of county income and educational attainment for both the male and female population. A codebook available here provides descriptions for the different codes. We will remove the cases (rows) from dat2 for which values are missing (i.e. cases having a NA designation) since these rows will serve no purpose (such cases may be associated with counties having no year-round residents or a resident population too small for data dissemination).

dat2 <- na.omit(dat2)

As with dat1, we will create a tidy version of dat2 for use with packages such as ggplot2.

The dat2 dataset has income data broken down by educational attainment and gender aggregated at the county level. It would therefore be convenient for plot operations if two variables, Gender and (educational) Level, were added to the long table version of dat2.

We will first generate a lookup table, Edu.Gend, of variable elements that will match each census category (e.g. B20004001, B20004002, …) to its matching pair of Level and Gender types.

We will also create a State/Region lookup table, st.reg, that will store two variables: the two letter state abbreviation variable State and its matching region variable Region. R has a built-in vector called state.region that assigns a region to each state. However, you’ll note that this vector only has region names but makes no reference to states. It’s intended to be used with another built-in data vector called state.abb or state.name. We will combine state.abb with state.region to create st.reg. We will also need to convert the uppercase state abbreviations to lower case using tolower so that they match the dat2’s lowercase state abbreviations. Note that D.C. is not included in the built-in states dataset, so we will add this record the st.reg table and assign D.C. to the South region.

Finally, the two tables, Edu.gend and st.reg, will be joined to the long version of dat2 such that each observation will be assigned a Level, Gender and Region.

library(stringr)
# Create a variable/Level/Gender join table
Edu.Gend <- data.frame( 
               variable   = paste("B200040", str_pad(1:18, width=2 , pad="0"),sep="" ),
               Level      = rep(c("All", "NoHS","HS","AD","BD","Grad"), times=3),
               Gender     = rep(c("All", "M","F"), each=6) )

# Create a region/state join table
st.reg <- data.frame(State = tolower(state.abb), Region = state.region)
st.reg <- rbind(st.reg , data.frame(State="dc", Region="South") )

# Start the piping operations
dat2b <- dat2 %>%
  gather(key = "variable", value = "value", -1:-2)  %>%
  inner_join(Edu.Gend, by="variable" )              %>%
  select(State, County, Level, Gender, value)       %>%
  mutate(Level = factor(Level, 
                        levels = c("All","NoHS","HS", "AD", "BD", "Grad"))) %>%
  inner_join(st.reg , by="State")
head(dat2b)
  State  County Level Gender value Region
1    al Autauga   All    All 35881  South
2    al Baldwin   All    All 31439  South
3    al Barbour   All    All 25201  South
4    al    Bibb   All    All 29016  South
5    al  Blount   All    All 32035  South
6    al Bullock   All    All 26408  South
tail(dat2b)
      State     County Level Gender value Region
55201    wy   Sublette  Grad      F 59683   West
55202    wy Sweetwater  Grad      F 63681   West
55203    wy      Teton  Grad      F 48357   West
55204    wy      Uinta  Grad      F 52321   West
55205    wy   Washakie  Grad      F 50341   West
55206    wy     Weston  Grad      F 56765   West

Note that we have eliminated references to variable names such as “B20004001” from dat2b making it easier to interpret the variable names/values.

In dat2b the gender values are lumped under the variable called Gender. We may wish to have male and female incomes assigned their own columns so we will need to widen dat2b so that each gender value is given its own column. We will use the spread function from the tidyr package to create a new data frame called dat2.gender.

dat2c <- spread(dat2b, key = Gender, value = value )

We will also take this opportunity to add a new field to dat2c called Region which will associate each state with a region. R has a built-in vector called state.region that assigns a region to each state.

state.region
 [1] South         West          West          South         West          West          Northeast     South        
 [9] South         South         West          West          North Central North Central North Central North Central
[17] South         South         Northeast     South         Northeast     North Central North Central South        
[25] North Central West          North Central West          Northeast     Northeast     West          Northeast    
[33] South         North Central North Central South         West          Northeast     Northeast     South        
[41] North Central South         South         West          Northeast     South         West          South        
[49] North Central West         
Levels: Northeast South North Central West

However, you’ll note that this vector only has region names but makes no reference to states. It’s intended to be used with another built-in data vector called state.abb or state.name. We will combine state.abb with state.region to create a data frame “lookup table”. We will also need to convert the uppercase state abbreviations to lower case using tolower so that they match the dat2b’s lowercase state abbreviations.

st.reg <- data.frame(State = tolower(state.abb), Region = state.region)

Our income dataset has data for all 50 states as well as the District of Columbia (D.C.) as shown in the following output:

unique(dat2c$State)
 [1] "ak" "al" "ar" "az" "ca" "co" "ct" "dc" "de" "fl" "ga" "hi" "ia" "id" "il" "in" "ks" "ky" "la" "ma" "md" "me" "mi"
[24] "mn" "mo" "ms" "mt" "nc" "nd" "ne" "nh" "nj" "nm" "nv" "ny" "oh" "ok" "or" "pa" "ri" "sc" "sd" "tn" "tx" "ut" "va"
[47] "vt" "wa" "wi" "wv" "wy"

But D.C. is not included in the built-in states dataset, so we will add this record the the st.reg table and assign D.C. to the South region.

st.reg <- rbind(st.reg , data.frame(State="dc", Region="South") )

Now we’ll join this look-up table to the dat2c data frame. We’ll use dplyrs inner_join function.

dat2c <- inner_join(dat2c, st.reg, by = "State")
head(dat2c)
  State                 County Level Region.x   All     F     M Region.y
1    ak Aleutians East Borough   All     West 21953 20164 22940     West
2    ak Aleutians East Borough  NoHS     West 21953 19250 22885     West
3    ak Aleutians East Borough    HS     West 20770 19671 21192     West
4    ak Aleutians East Borough    AD     West 26383 26750 26352     West
5    ak Aleutians East Borough    BD     West 22431 19592 27875     West
6    ak Aleutians East Borough  Grad     West 74000 74000 71250     West