This tutorial makes use of the following R package(s): dplyr
, tidyr
and stringr
The following data are used in some of the subsequent tutorials (including the one on ggplot2
) and make use of some advanced data manipulation routines. The input data file formats are provided as is by their source and are modified to facilitate ingestion into some the plotting routines covered in later exercises. Data used in this tutorial include grain harvest for north america and income/education census data for the US.
dat1 <- read.csv("http://mgimond.github.io/ES218/Data/FAO_grains_NA.csv", header=TRUE)
dat2 <- read.csv("http://mgimond.github.io/ES218/Data/Income_education.csv", header=TRUE)
dat1
dat1
consists of grain yields by north american countries and by year. The dataset was downloaded from http://faostat3.fao.org/ in June of 2014.
A subset of dat1
will be used in later tutorials in both a wide form and a long form. The wide form will be called dat1w
and will be a table of year vs. crop yields.
library(dplyr)
library(tidyr)
dat1w <- dat1 %>%
filter(Information == "Yield (Hg/Ha)",
Country =="United States of America",
Crop %in% c("Oats", "Maize", "Barley", "Buckwheat","Rye")) %>%
select(Year, Crop, Value) %>%
spread(key = Crop, value="Value")
head(dat1w)
Year Barley Buckwheat Maize Oats Rye
1 1961 16488.52 10886.67 39183.63 15171.26 11121.79
2 1962 18839.00 11737.50 40620.80 16224.60 12892.77
3 1963 18808.27 11995.00 42595.55 16253.04 11524.11
4 1964 20208.88 11566.50 39498.36 15471.55 12026.24
5 1965 23070.58 11875.00 46492.14 18001.04 14192.91
6 1966 20581.66 11956.52 45891.88 16117.92 13670.75
The long form version of the subset will be called dat1l
and will be a long form representation of dat1w
(yield by crop and year).
dat1l <- gather(dat1w, key = "Crop", value = "Yield", 2:6)
head(dat1l,10)
Year Crop Yield
1 1961 Barley 16488.52
2 1962 Barley 18839.00
3 1963 Barley 18808.27
4 1964 Barley 20208.88
5 1965 Barley 23070.58
6 1966 Barley 20581.66
7 1967 Barley 21785.30
8 1968 Barley 23557.07
9 1969 Barley 24039.46
10 1970 Barley 23048.77
Another subset will be used in subsequent exercises and will consist of total yields for each year by crop and country.
dat1l2 <- dat1 %>%
filter(Information == "Yield (Hg/Ha)",
Crop %in% c("Oats", "Maize", "Barley", "Buckwheat","Rye")) %>%
select( Year, Crop, Country, Yield = Value) # Note that we are renaming the Value field
head(dat1l2,15)
Year Crop Country Yield
1 2012 Barley Canada 38894.66
2 2012 Maize Canada 83611.49
3 2012 Oats Canada 24954.79
4 2012 Rye Canada 38056.86
5 2012 Barley United States of America 36533.24
6 2012 Buckwheat United States of America 10445.86
7 2012 Maize United States of America 77441.67
8 2012 Oats United States of America 21974.70
9 2012 Rye United States of America 17575.73
10 2011 Barley Canada 32796.43
11 2011 Maize Canada 88946.49
12 2011 Oats Canada 29109.36
13 2011 Rye Canada 24676.81
14 2011 Barley United States of America 37431.96
15 2011 Buckwheat United States of America 10299.05
dat2
dat2b
dat2
consists of county income and educational attainment for both the male and female population. A codebook available here provides descriptions for the different codes. We will remove the cases (rows) from dat2
for which values are missing (i.e. cases having a NA
designation) since these rows will serve no purpose (such cases may be associated with counties having no year-round residents or a resident population too small for data dissemination).
dat2 <- na.omit(dat2)
As with dat1
, we will create a tidy version of dat2
for use with packages such as ggplot2
.
The dat2
dataset has income data broken down by educational attainment and gender aggregated at the county level. It would therefore be convenient for plot operations if two variables, Gender
and (educational) Level
, were added to the long table version of dat2
.
We will first generate a lookup table, Edu.Gend
, of variable elements that will match each census category (e.g. B20004001
, B20004002
, …) to its matching pair of Level
and Gender
types.
We will also create a State/Region lookup table, st.reg
, that will store two variables: the two letter state abbreviation variable State
and its matching region variable Region
. R has a built-in vector called state.region
that assigns a region to each state. However, you’ll note that this vector only has region names but makes no reference to states. It’s intended to be used with another built-in data vector called state.abb
or state.name
. We will combine state.abb
with state.region
to create st.reg
. We will also need to convert the uppercase state abbreviations to lower case using tolower
so that they match the dat2
’s lowercase state abbreviations. Note that D.C. is not included in the built-in states dataset, so we will add this record the st.reg
table and assign D.C. to the South
region.
Finally, the two tables, Edu.gend
and st.reg
, will be joined to the long version of dat2
such that each observation will be assigned a Level
, Gender
and Region
.
library(stringr)
# Create a variable/Level/Gender join table
Edu.Gend <- data.frame(
variable = paste("B200040", str_pad(1:18, width=2 , pad="0"),sep="" ),
Level = rep(c("All", "NoHS","HS","AD","BD","Grad"), times=3),
Gender = rep(c("All", "M","F"), each=6) )
# Create a region/state join table
st.reg <- data.frame(State = tolower(state.abb), Region = state.region)
st.reg <- rbind(st.reg , data.frame(State="dc", Region="South") )
# Start the piping operations
dat2b <- dat2 %>%
gather(key = "variable", value = "value", -1:-2) %>%
left_join(Edu.Gend, by="variable" ) %>%
select(State, County, Level, Gender, value) %>%
mutate(Level = factor(Level,
levels = c("All","NoHS","HS", "AD", "BD", "Grad"))) %>%
left_join(st.reg , by="State")
head(dat2b)
State County Level Gender value Region
1 al Autauga All All 35881 South
2 al Baldwin All All 31439 South
3 al Barbour All All 25201 South
4 al Bibb All All 29016 South
5 al Blount All All 32035 South
6 al Bullock All All 26408 South
tail(dat2b)
State County Level Gender value Region
55201 wy Sublette Grad F 59683 West
55202 wy Sweetwater Grad F 63681 West
55203 wy Teton Grad F 48357 West
55204 wy Uinta Grad F 52321 West
55205 wy Washakie Grad F 50341 West
55206 wy Weston Grad F 56765 West
Note that we have eliminated references to variable names such as “B20004001” from dat2b
making it easier to interpret the variable names/values. Also note that we have re-leveled the educational attainment factor Level
to reflect the implied order in educational attainment levels.
dat2c
In dat2b
the gender values are lumped under the variable called Gender
. In some upcoming lectures, we will want to compare male and female incomes requiring that they be assigned their own columns. We will therefore widen dat2b
. We will use the spread
function from the tidyr
package to create a new data frame called dat2c
.
dat2c <- spread(dat2b, key = Gender, value = value )
head(dat2c)
State County Level Region All F M
1 ak Aleutians East Borough All West 21953 20164 22940
2 ak Aleutians East Borough NoHS West 21953 19250 22885
3 ak Aleutians East Borough HS West 20770 19671 21192
4 ak Aleutians East Borough AD West 26383 26750 26352
5 ak Aleutians East Borough BD West 22431 19592 27875
6 ak Aleutians East Borough Grad West 74000 74000 71250