dplyr tidyr stringr
0.8.3 1.0.2 1.4.0

The following data are used in some of the subsequent tutorials (including the one on ggplot2) and make use of some advanced data manipulation routines. The input data file formats are provided as is by their source and are modified to facilitate ingestion into some the plotting routines covered in later exercises. Data used in this tutorial include grain harvest for north america and income/education census data for the US.

dat1 <- read.csv("http://mgimond.github.io/ES218/Data/FAO_grains_NA.csv", header=TRUE)
dat2 <- read.csv("http://mgimond.github.io/ES218/Data/Income_education.csv", header=TRUE)

## Dataset dat1

dat1 consists of grain yields by north american countries and by year. The dataset was downloaded from http://faostat3.fao.org/ in June of 2014.

A subset of dat1 will be used in later tutorials in both a wide form and a long form. The wide form will be called dat1w and will be a table of year vs. crop yields.

library(dplyr)
library(tidyr)

dat1w <- dat1 %>%
filter(Information == "Yield (Hg/Ha)",
Country     =="United States of America",
Crop       %in% c("Oats", "Maize", "Barley", "Buckwheat","Rye")) %>%
select(Year, Crop, Value)                                          %>%
head(dat1w)
  Year   Barley Buckwheat    Maize     Oats      Rye
1 1961 16488.52  10886.67 39183.63 15171.26 11121.79
2 1962 18839.00  11737.50 40620.80 16224.60 12892.77
3 1963 18808.27  11995.00 42595.55 16253.04 11524.11
4 1964 20208.88  11566.50 39498.36 15471.55 12026.24
5 1965 23070.58  11875.00 46492.14 18001.04 14192.91
6 1966 20581.66  11956.52 45891.88 16117.92 13670.75

The long form version of the subset will be called dat1l and will be a long form representation of dat1w (yield by crop and year).

dat1l <- gather(dat1w, key = "Crop", value = "Yield", 2:6)
head(dat1l,10)
   Year   Crop    Yield
1  1961 Barley 16488.52
2  1962 Barley 18839.00
3  1963 Barley 18808.27
4  1964 Barley 20208.88
5  1965 Barley 23070.58
6  1966 Barley 20581.66
7  1967 Barley 21785.30
8  1968 Barley 23557.07
9  1969 Barley 24039.46
10 1970 Barley 23048.77

Another subset will be used in subsequent exercises and will consist of total yields for each year by crop and country.

dat1l2 <- dat1 %>%
filter(Information == "Yield (Hg/Ha)",
Crop       %in% c("Oats", "Maize", "Barley", "Buckwheat","Rye")) %>%
select( Year, Crop, Country,  Yield = Value)  # Note that we are renaming the Value field

head(dat1l2,15)
   Year      Crop                  Country    Yield
5  2012    Barley United States of America 36533.24
6  2012 Buckwheat United States of America 10445.86
7  2012     Maize United States of America 77441.67
8  2012      Oats United States of America 21974.70
9  2012       Rye United States of America 17575.73
14 2011    Barley United States of America 37431.96
15 2011 Buckwheat United States of America 10299.05

## Dataset dat2

### A tidy table: dat2b

dat2 consists of county income and educational attainment for both the male and female population. A codebook available here provides descriptions for the different codes. We will remove the cases (rows) from dat2 for which values are missing (i.e. cases having a NA designation) since these rows will serve no purpose (such cases may be associated with counties having no year-round residents or a resident population too small for data dissemination).

dat2 <- na.omit(dat2)

As with dat1, we will create a tidy version of dat2 for use with packages such as ggplot2.

The dat2 dataset has income data broken down by educational attainment and gender aggregated at the county level. It would therefore be convenient for plot operations if two variables, Gender and (educational) Level, were added to the long table version of dat2.

We will first generate a lookup table, Edu.Gend, of variable elements that will match each census category (e.g. B20004001, B20004002, …) to its matching pair of Level and Gender types.

We will also create a State/Region lookup table, st.reg, that will store two variables: the two letter state abbreviation variable State and its matching region variable Region. R has a built-in vector called state.region that assigns a region to each state. However, you’ll note that this vector only has region names but makes no reference to states. It’s intended to be used with another built-in data vector called state.abb or state.name. We will combine state.abb with state.region to create st.reg. We will also need to convert the uppercase state abbreviations to lower case using tolower so that they match the dat2’s lowercase state abbreviations. Note that D.C. is not included in the built-in states dataset, so we will add this record the st.reg table and assign D.C. to the South region.

Finally, the two tables, Edu.gend and st.reg, will be joined to the long version of dat2 such that each observation will be assigned a Level, Gender and Region.

library(stringr)
# Create a variable/Level/Gender join table
Edu.Gend <- data.frame(
Gender     = rep(c("All", "M","F"), each=6) )

# Create a region/state join table
st.reg <- data.frame(State = tolower(state.abb), Region = state.region)
st.reg <- rbind(st.reg , data.frame(State="dc", Region="South") )

# Start the piping operations
dat2b <- dat2 %>%
gather(key = "variable", value = "value", -1:-2)  %>%
left_join(Edu.Gend, by="variable" )              %>%
select(State, County, Level, Gender, value)       %>%
mutate(Level = factor(Level,
left_join(st.reg , by="State")
head(dat2b)
  State  County Level Gender value Region
1    al Autauga   All    All 35881  South
2    al Baldwin   All    All 31439  South
3    al Barbour   All    All 25201  South
4    al    Bibb   All    All 29016  South
5    al  Blount   All    All 32035  South
6    al Bullock   All    All 26408  South
tail(dat2b)
      State     County Level Gender value Region
55201    wy   Sublette  Grad      F 59683   West
55202    wy Sweetwater  Grad      F 63681   West
55203    wy      Teton  Grad      F 48357   West
55204    wy      Uinta  Grad      F 52321   West
55205    wy   Washakie  Grad      F 50341   West
55206    wy     Weston  Grad      F 56765   West

Note that we have eliminated references to variable names such as “B20004001” from dat2b making it easier to interpret the variable names/values. Also note that we have re-leveled the educational attainment factor Level to reflect the implied order in educational attainment levels.

### Spreading gender across columns: dat2c

In dat2b the gender values are lumped under the variable called Gender. In some upcoming lectures, we will want to compare male and female incomes requiring that they be assigned their own columns. We will therefore widen dat2b. We will use the spread function from the tidyr package to create a new data frame called dat2c.

dat2c <- spread(dat2b, key = Gender, value = value )

head(dat2c)
  State                 County Level Region   All     F     M
1    ak Aleutians East Borough   All   West 21953 20164 22940
2    ak Aleutians East Borough  NoHS   West 21953 19250 22885
3    ak Aleutians East Borough    HS   West 20770 19671 21192
4    ak Aleutians East Borough    AD   West 26383 26750 26352
5    ak Aleutians East Borough    BD   West 22431 19592 27875
6    ak Aleutians East Borough  Grad   West 74000 74000 71250