This tutorial makes use of the following R package(s): dplyr and lubridate.

The data file FAO_grains_NA.csv will be used in this exercise. This dataset consists of grain yield and harvest year by North American country. The dataset was downloaded from http://faostat3.fao.org/ in June of 2014.

Run the following line to load the FAO data file into your current R session.

dat <- read.csv("http://mgimond.github.io/ES218/Data/FAO_grains_NA.csv", header=TRUE)

The dplyr basics

The basic set of R tools can accomplish many data table queries, but the syntax can be overwhelming and verbose. The package dplyr offers some nifty and simple querying functions as shown in the next subsections. dplyr’s data manipulation functions are summarized in the following table:

dplyr function Description
filter() Subset by row values
arrange() Sort rows by column values
select() Subset columns
mutate() Add columns
summarise() Summarize columns

Make sure to load the dplyr package before proceeding with the following examples.

library(dplyr)

Subset by rows: filter

Tables can be subsetted by rows based on column values. For example, we may wish to grab all rows associated with Oats:

dat.query1 <- filter(dat, Crop == "Oats")
summary(dat.query1)
                     Country              Crop                  Information       Year          Value        
 Canada                  :104   Oats        :208   Area harvested (Ha):104   Min.   :1961   Min.   :  12667  
 United States of America:104   Barley      :  0   Yield (Hg/Ha)      :104   1st Qu.:1974   1st Qu.:  20392  
                                Buckwheat   :  0                             Median :1986   Median : 204821  
                                Canary seed :  0                             Mean   :1986   Mean   :1350024  
                                Grain, mixed:  0                             3rd Qu.:1999   3rd Qu.:1638250  
                                Maize       :  0                             Max.   :2012   Max.   :9666550  
                                (Other)     :  0                                                             
                                      Source   
 Calculated data                         :104  
 FAO data based on imputation methodology:  0  
 FAO estimate                            :  0  
 Official data                           :104  
                                               
                                               
                                               

Note that R is case sensitive, so make sure that you respect each letter’s case (i.e. upper or lower).

We can expand our query by including both Oats, Buckwheat and limiting the country to Canada.

dat.query2 <- filter(dat, Crop == "Oats" | Crop == "Buckwheat", 
                          Country == "Canada")
summary(dat.query2)
                     Country              Crop                  Information       Year          Value        
 Canada                  :200   Oats        :104   Area harvested (Ha):101   Min.   :1961   Min.   :      0  
 United States of America:  0   Buckwheat   : 96   Yield (Hg/Ha)      : 99   1st Qu.:1973   1st Qu.:  11822  
                                Barley      :  0                             Median :1986   Median :  22197  
                                Canary seed :  0                             Mean   :1986   Mean   : 504098  
                                Grain, mixed:  0                             3rd Qu.:1998   3rd Qu.: 924675  
                                Maize       :  0                             Max.   :2012   Max.   :4277000  
                                (Other)     :  0                                                             
                                      Source  
 Calculated data                         :99  
 FAO data based on imputation methodology: 0  
 FAO estimate                            : 2  
 Official data                           :99  
                                              
                                              
                                              

The character | is the Boolean operator OR. So in our example, the query can be read as “… crop equals oats OR crop equals buckwheat”. Had we used the AND operator, &, instead as in Crop == "Oats" & Crop == "Buckwheat" the output would have returned zero rows since a Crop value cannot be both Oats AND Buckwheat.

We can expand this query by limiting our output to the years 2005 to 2010

library(dplyr)
dat.query3 <- filter(dat, Crop == "Oats" | Crop == "Buckwheat", 
                          Country == "Canada", 
                          Year >= 2005 & Year <= 2010)
summary(dat.query3)
                     Country             Crop                 Information      Year          Value        
 Canada                  :18   Oats        :12   Area harvested (Ha):9    Min.   :2005   Min.   :   2000  
 United States of America: 0   Buckwheat   : 6   Yield (Hg/Ha)      :9    1st Qu.:2006   1st Qu.:  11500  
                               Barley      : 0                            Median :2007   Median :  26615  
                               Canary seed : 0                            Mean   :2007   Mean   : 453810  
                               Grain, mixed: 0                            3rd Qu.:2008   3rd Qu.: 961825  
                               Maize       : 0                            Max.   :2010   Max.   :1815700  
                               (Other)     : 0                                                            
                                      Source 
 Calculated data                         :9  
 FAO data based on imputation methodology:0  
 FAO estimate                            :0  
 Official data                           :9  
                                             
                                             
                                             

Note the use of the AND Boolean operator (&) instead of the OR operator (|) for the Year query. We want the Year value to satisfy two criteria simultaneously: greater than or equal to 2005 AND less than or equal to 2010. Had we used the | operator, R would have returned all years since all year values satisfy at least one of the two criteria.

Sort rows by column value: arrange

You can sort a table based on a column’s values. For example, to sort dat by crop name type:

dat.sort1 <- arrange(dat, Crop)
head(dat.sort1)
                   Country   Crop         Information Year      Value          Source
1                   Canada Barley Area harvested (Ha) 2012 2060000.00   Official data
2                   Canada Barley       Yield (Hg/Ha) 2012   38894.66 Calculated data
3 United States of America Barley Area harvested (Ha) 2012 1312810.00   Official data
4 United States of America Barley       Yield (Hg/Ha) 2012   36533.24 Calculated data
5                   Canada Barley Area harvested (Ha) 2011 2364800.00   Official data
6                   Canada Barley       Yield (Hg/Ha) 2011   32796.43 Calculated data
tail(dat.sort1)
     Country      Crop         Information Year    Value                                   Source
1496  Canada Triticale Area harvested (Ha) 1991  1093.00                            Official data
1497  Canada Triticale       Yield (Hg/Ha) 1991 21957.91                          Calculated data
1498  Canada Triticale Area harvested (Ha) 1990  1074.00 FAO data based on imputation methodology
1499  Canada Triticale       Yield (Hg/Ha) 1990 26396.65                          Calculated data
1500  Canada Triticale Area harvested (Ha) 1989  1093.00                             FAO estimate
1501  Canada Triticale       Yield (Hg/Ha) 1989 21957.91                          Calculated data

By default, arrange sorts by ascending order. To sort by descending order, wrap the column name with the function desc(). For example, to sort the table by Crop in ascending order then by Year in descending order, type:

dat.sort2 <- arrange(dat, Crop, desc(Year))
head(dat.sort2)
                   Country   Crop         Information Year      Value          Source
1                   Canada Barley Area harvested (Ha) 2012 2060000.00   Official data
2                   Canada Barley       Yield (Hg/Ha) 2012   38894.66 Calculated data
3 United States of America Barley Area harvested (Ha) 2012 1312810.00   Official data
4 United States of America Barley       Yield (Hg/Ha) 2012   36533.24 Calculated data
5                   Canada Barley Area harvested (Ha) 2011 2364800.00   Official data
6                   Canada Barley       Yield (Hg/Ha) 2011   32796.43 Calculated data
tail(dat.sort2)
     Country      Crop         Information Year    Value                                   Source
1496  Canada Triticale Area harvested (Ha) 1991  1093.00                            Official data
1497  Canada Triticale       Yield (Hg/Ha) 1991 21957.91                          Calculated data
1498  Canada Triticale Area harvested (Ha) 1990  1074.00 FAO data based on imputation methodology
1499  Canada Triticale       Yield (Hg/Ha) 1990 26396.65                          Calculated data
1500  Canada Triticale Area harvested (Ha) 1989  1093.00                             FAO estimate
1501  Canada Triticale       Yield (Hg/Ha) 1989 21957.91                          Calculated data

Subset by column: select

You can grab a subset of the table by column(s). To extract the columns Crop, Year and Value, type:

dat.subcol <- select(dat, Crop, Year, Value)
head(dat.subcol)
          Crop Year      Value
1       Barley 2012 2060000.00
2       Barley 2012   38894.66
3    Buckwheat 2012       0.00
4  Canary seed 2012  101900.00
5  Canary seed 2012   12161.92
6 Grain, mixed 2012   57900.00

Adding columns: mutate

You can add columns (and compute their values) using the mutate function. For example, to add a column Ctr_abbr and assign it the abbreviated values CAN for Canada and USA for the United States of America based on the values in column Country type:

dat.extended <- mutate(dat, Ctr_abbr = ifelse(Country == "Canada", "CAN", "USA"))
head(dat.extended)
  Country         Crop         Information Year      Value          Source Ctr_abbr
1  Canada       Barley Area harvested (Ha) 2012 2060000.00   Official data      CAN
2  Canada       Barley       Yield (Hg/Ha) 2012   38894.66 Calculated data      CAN
3  Canada    Buckwheat Area harvested (Ha) 2012       0.00    FAO estimate      CAN
4  Canada  Canary seed Area harvested (Ha) 2012  101900.00   Official data      CAN
5  Canada  Canary seed       Yield (Hg/Ha) 2012   12161.92 Calculated data      CAN
6  Canada Grain, mixed Area harvested (Ha) 2012   57900.00   Official data      CAN
tail(dat.extended)
                      Country    Crop         Information Year      Value          Source Ctr_abbr
1496 United States of America Popcorn Area harvested (Ha) 1961   83162.00   Official data      USA
1497 United States of America Popcorn       Yield (Hg/Ha) 1961   21762.34 Calculated data      USA
1498 United States of America     Rye Area harvested (Ha) 1961  624000.00   Official data      USA
1499 United States of America     Rye       Yield (Hg/Ha) 1961   11121.79 Calculated data      USA
1500 United States of America Sorghum Area harvested (Ha) 1961 4445000.00   Official data      USA
1501 United States of America Sorghum       Yield (Hg/Ha) 1961   27442.07 Calculated data      USA

Here, we make use of an embedded function, ifelse, which performs a conditional operation: if the Country value is Canada return CAN if not, return USA.

Note that if you wish to rename a column, you can use the rename() function instead of mutate.

Summarize columns: summarise

You can summarize (or “collapse”) one or more columns using the summarise function. For instance, to get the minimum and maximum years from the Year column, type:

summarise(dat, yr_min = min(Year), yr_max=max(Year))
  yr_min yr_max
1   1961   2012

Splitting data operations by group: group_by

The group_by function will split any operations applied to the dataframe into groups defined by one or more columns. For example, if we wanted to get the minimum and maximum years from the Year column for which crop data are available by crop type, we would type the following:

dat.grp <- group_by(dat, Crop)
summarise(dat.grp, yr_min = min(Year), yr_max=max(Year))
# A tibble: 11 × 3
           Crop yr_min yr_max
         <fctr>  <int>  <int>
1        Barley   1961   2012
2     Buckwheat   1961   2012
3   Canary seed   1980   2012
4  Grain, mixed   1961   2012
5         Maize   1961   2012
6        Millet   1961   2012
7          Oats   1961   2012
8       Popcorn   1961   1982
9           Rye   1961   2012
10      Sorghum   1961   2012
11    Triticale   1989   2012

The dat.grp object is a special data structure which differs from that of dat in that it explicitly defines the groups for which any subsequent operations (like summarise) are to be split across.

dplyr’s output data structure

Many of dplyr’s functions generate a tibble data table. For example, if we stored the summary output from the last chunk of code to an object called dat.sum, we’d discover that this object is associated with a tb_df (a tibble).

dat.sum <- summarise(dat.grp, yr_min = min(Year), yr_max=max(Year))
class(dat.sum)
[1] "tbl_df"     "tbl"        "data.frame"

A tibble table will behave a little differently than a data frame table when printing to a screen or subsetting its elements. In most cases, a tibble rendering of the table will not pose a problem in a workflow, however, this format may prove problematic with some functions. To remedy this, you can force the dat.sum object to a standalone dataframe as follows:

dat.df <- as.data.frame(dat.sum)
class(dat.df)
[1] "data.frame"

Combining table manipulation functions using %>%

In most cases, you will find yourself wanting to combine several of dplyr’s data manipulation functions. For example, to summarize the total yield of oats by country, you could type the following:

dat.yield <- filter(dat, Information == "Yield (Hg/Ha)", Crop == "Oats")
dat.grp   <- group_by(dat.yield, Country)
dat.sum   <- summarise(dat.grp, mean_yield = mean(Value))
data.frame(dat.sum)
                   Country mean_yield
1                   Canada   21932.51
2 United States of America   19785.44

The one downside to this approach of combining data table manipulation functions for complex queries is the use of several intermediate objects (e.g. dat.yield, dat.grp, dat.sum). This can make the workflow difficult to follow and clutter your R session with needless intermediate objects.

Another approach to combining dplyr operations is to use the piping operator ,%>%, which daisy chains dplyr operations. So our previous workflow would look like:

dat %>%
  filter(Information == "Yield (Hg/Ha)", Crop == "Oats") %>%
  group_by(Country) %>%
  summarise(mean_yield = mean(Value)) %>%
  data.frame()
                   Country mean_yield
1                   Canada   21932.51
2 United States of America   19785.44

The chunk of code can be read as “… with the dat table, filter by …, then group_by …., then summarise by …” with the result from one operation being passed on to the next using the %>% operator. Note that the filter, group_by and summarise functions do not include the data table name making the chunk of code less cluttered and easier to read.

This chunk of code will output the results to the screen, if you want to save it to an object called dat2, replace the first line with dat2 <- dat %>% ....

Here’s a more complicated output whereby the median yield for each USA crop is summarized by year for the years spanning 2005 through 2010 and the output is sorted by crop name (in ascending order) and median yield value (in descending order).

dat %>%
  filter(Information == "Yield (Hg/Ha)", Year >= 2005, Year <=2010, 
         Country=="United States of America") %>%
  group_by(Crop, Year) %>%
  summarise(median_yield = round(median(Value))) %>%
  arrange(Crop, desc(median_yield))
Source: local data frame [42 x 3]
Groups: Crop [7]

        Crop  Year median_yield
      <fctr> <int>        <dbl>
1     Barley  2010        39345
2     Barley  2009        39287
3     Barley  2005        34873
4     Barley  2008        34195
5     Barley  2006        32846
6     Barley  2007        32278
7  Buckwheat  2005        11019
8  Buckwheat  2009        10931
9  Buckwheat  2010        10555
10 Buckwheat  2006        10000
# ... with 32 more rows

Conditional statements

The base ifelse

Conditional statements are used when you want to create an output value that is conditioned on an evaluation. For example, if you want to output a value of 1 if an input value is less than 23 and a value of 0 otherwise, you could make use of the ifelse function:

x <- c(12,102, 43, 20, 90, 0, 12, 6)
ifelse(x < 23, 1, 0)
[1] 1 0 0 1 0 1 1 1

The base ifelse function works as expected when the input/output values are numeric or character, but does not work as expected when applied to factors or dates. For example, if you wish to replace one factor level with another,

x <- as.factor( c("apple", "banana", "banana", "pear", "apple"))
ifelse(x == "pear", "apple", x)
[1] "1"     "2"     "2"     "apple" "1"    

The output is a character where the input factor x is converted to a character representation of its level number (recall that factors encode level values as numbers behind the scene, i.e. apple =1, banana=2, etc…). Likewise, if you wish to replace an erroneous date with a missing value you will get:

library(lubridate)
y = mdy("1/23/2016", "3/2/2016", "12/1/1901", "11/23/2016")
ifelse( year(y) != 2016, NA, y)
[1] 16823 16862    NA 17128

Here, ifelse converts the date object to its internal numeric representation as number of days since 1970.

If you want to ensure that the data type is preserved, a safer alternative is to use dplyr’s if_else function.

dplyr’s if_else

The if_else function (note the underscore _), will preserve data type but does so strictly. For example, the following code will return an error:

if_else( year(y) != 2016, NA, y)
Error: `false` must be type logical, not double

The output data types (NA and y) are not treated the same (by seeing NA, the function assumes that the output must be logical which y is not). The solution is to force NA as a missing date object by wrapping it with a date function as in mdy(NA):

if_else( year(y) != 2016, mdy(NA), y)
[1] "2016-01-23" "2016-03-02" NA           "2016-11-23"

Recoding factors using recode

When working with factors, however, if_else (as of dplyr version 0.7) will produce the following error:

if_else(x == "pear", "apple", x)
Error: `false` must be type character, not integer

R sees apple as a character and x as a number (i.e. the numeric representation of the factor level). A better option for recoding factors is to use dplyr’s recode function:

recode(x , "pear" = "apple")
[1] apple  banana banana apple  apple 
Levels: apple banana

You can recode more than one factor level. In fact, you can even introduce new levels in the recoding scheme:

recode(x , "pear" = "apple", "banana" = "pineapple" )
[1] apple     pineapple pineapple apple     apple    
Levels: apple pineapple

Changing values based on multiple conditions: case_when

ifelse and if_else work great when a single set of conditions is to be satisfied. But if multiple sets of conditions are to be tested, nested if/else statements become cumbersome and are prone to clerical error. The following code highlights an example of nested if/else statements.

z <- c("deg","F","C", "Temperature")
if_else( z == "deg", "Degree", if_else(z == "F", "Fahrenheit", "Celsius"))
[1] "Degree"     "Fahrenheit" "Celsius"    "Celsius"   

A better alternative is to use case_when.

case_when(z == "deg" ~ "Degree",
          z == "F" ~ "Fahrenheit",
          z == "C" ~ "Celsius",
          TRUE ~ z)
[1] "Degree"      "Fahrenheit"  "Celsius"     "Temperature"

The last parameter, TRUE ~ z instructs R to keep all other values as is (not doing so would return NA values).

Note that ifelse, if_else, recode and case_when can all be used inside of a mutate function. For example, to replace Canada and United States of America in variable Country with CAN and USA respectively and to create a new variable called Type which will take on the values of 1, 2 or 3 depending on the values in variable Source, type the following:

dat1 <- dat %>% mutate(Country = recode(Country, "Canada" = "CAN",
                                                 "United States of America" = "USA"),
                       Type = case_when(Source == "Calculated data" ~ 1,
                                        Source == "Official data" ~ 2,
                                        TRUE ~ 3)) 
head(dat1)   
  Country         Crop         Information Year      Value          Source Type
1     CAN       Barley Area harvested (Ha) 2012 2060000.00   Official data    2
2     CAN       Barley       Yield (Hg/Ha) 2012   38894.66 Calculated data    1
3     CAN    Buckwheat Area harvested (Ha) 2012       0.00    FAO estimate    3
4     CAN  Canary seed Area harvested (Ha) 2012  101900.00   Official data    2
5     CAN  Canary seed       Yield (Hg/Ha) 2012   12161.92 Calculated data    1
6     CAN Grain, mixed Area harvested (Ha) 2012   57900.00   Official data    2

More on the group_by function

Count the number of records in each group

In this example, we are identifying the number of records by Crop type. There are two ways this can be accomplished:

dat %>%
  filter(Information == "Yield (Hg/Ha)", Year >= 2005, Year <=2010, 
         Country=="United States of America") %>%
  group_by(Crop) %>%
  count()

Or,

dat %>%
  filter(Information == "Yield (Hg/Ha)", Year >= 2005, Year <=2010, 
         Country=="United States of America") %>%
  group_by(Crop) %>%
  summarise(Count = n())
# A tibble: 7 x 2
       Crop Count
     <fctr> <int>
1    Barley     6
2 Buckwheat     6
3     Maize     6
4    Millet     6
5      Oats     6
6       Rye     6
7   Sorghum     6

Summarize by crop yield sum and year range

dat %>%
  filter(Information == "Yield (Hg/Ha)", Year >= 2005, Year <=2010, 
         Country=="United States of America") %>%
  group_by(Crop) %>%
  summarise( Yield = sum(Value), `Number of Years` = max(Year) - min(Year)) 
# A tibble: 7 x 3
       Crop     Yield `Number of Years`
     <fctr>     <dbl>             <dbl>
1    Barley 212823.30                 5
2 Buckwheat  62505.56                 5
3     Maize 576907.21                 5
4    Millet  99289.42                 5
5      Oats 135712.89                 5
6       Rye 102789.63                 5
7   Sorghum 253548.41                 5

Normalizing each value in a group by the group median

In this example, we are subtracting each value in a group by that group’s median. This can be useful in identifying which year yields are higher than or lower than the median yield value within each crop group. We will concern ourselves with US yields only and sort the output by crop type. We’ll save the output dataframe as dat2.

dat2 <- dat %>% 
  filter(Information == "Yield (Hg/Ha)",
         Country == "United States of America") %>%
  select(Crop, Year, Value)                     %>%
  group_by(Crop)                                %>%
  mutate(NormYield = Value - median(Value))     %>%
  arrange(Crop)

Let’s plot the normalized yields by year for Barley and add a 0 line representing the (normalized) central value.

plot( NormYield ~ Year, dat2[dat2$Crop == "Barley",] )
abline(h = 0, col="red")

The relative distribution of points does not change, but the values do (they are re-scaled) allowing us to compare values based on some localized (group) context. This technique will prove very useful later on in the course when EDA topics are explored.