This tutorial makes use of the following R package(s): dplyr, lubridate and forcats.

The data file FAO_grains_NA.csv will be used in this exercise. This dataset consists of grain yield and harvest year by North American country. The dataset was downloaded from http://faostat3.fao.org/ in June of 2014.

Run the following line to load the FAO data file into your current R session.

dat <- read.csv("http://mgimond.github.io/ES218/Data/FAO_grains_NA.csv", header=TRUE)

The dplyr basics

The basic set of R tools can accomplish many data table queries, but the syntax can be overwhelming and verbose. The package dplyr offers some nifty and simple querying functions as shown in the next subsections. Some of dplyr’s key data manipulation functions are summarized in the following table:

dplyr function Description
filter() Subset by row values
arrange() Sort rows by column values
select() Subset columns
mutate() Add columns
summarise() Summarize columns

Note that all of these functions take as first argument the data table name except when used in a piping operation (pipes will be discussed later in this section). For example:

# When used alone, dataframe dat is inside the select function
dat2 <- select(dat, column1)

# When used in a pipe, dataframe dat is outside of the select function
dat2 <- dat %>% select(column1)

Make sure to load the dplyr package before proceeding with the following examples.

library(dplyr)

Subset by rows: filter

Tables can be subsetted by rows based on column values. For example, we may wish to grab all rows associated with Oats:

dat.query1 <- filter(dat, Crop == "Oats")
summary(dat.query1)
                     Country              Crop                  Information       Year          Value        
 Canada                  :104   Oats        :208   Area harvested (Ha):104   Min.   :1961   Min.   :  12667  
 United States of America:104   Barley      :  0   Yield (Hg/Ha)      :104   1st Qu.:1974   1st Qu.:  20392  
                                Buckwheat   :  0                             Median :1986   Median : 204821  
                                Canary seed :  0                             Mean   :1986   Mean   :1350024  
                                Grain, mixed:  0                             3rd Qu.:1999   3rd Qu.:1638250  
                                Maize       :  0                             Max.   :2012   Max.   :9666550  
                                (Other)     :  0                                                             
                                      Source   
 Calculated data                         :104  
 FAO data based on imputation methodology:  0  
 FAO estimate                            :  0  
 Official data                           :104  
                                               
                                               
                                               

Note that R is case sensitive, so make sure that you respect each letter’s case (i.e. upper or lower).

We can expand our query by including both Oats, Buckwheat and limiting the country to Canada.

dat.query2 <- filter(dat, Crop == "Oats" | Crop == "Buckwheat", 
                          Country == "Canada")
summary(dat.query2)
                     Country              Crop                  Information       Year          Value        
 Canada                  :200   Oats        :104   Area harvested (Ha):101   Min.   :1961   Min.   :      0  
 United States of America:  0   Buckwheat   : 96   Yield (Hg/Ha)      : 99   1st Qu.:1973   1st Qu.:  11822  
                                Barley      :  0                             Median :1986   Median :  22197  
                                Canary seed :  0                             Mean   :1986   Mean   : 504098  
                                Grain, mixed:  0                             3rd Qu.:1998   3rd Qu.: 924675  
                                Maize       :  0                             Max.   :2012   Max.   :4277000  
                                (Other)     :  0                                                             
                                      Source  
 Calculated data                         :99  
 FAO data based on imputation methodology: 0  
 FAO estimate                            : 2  
 Official data                           :99  
                                              
                                              
                                              

The character | is the Boolean operator OR. So in our example, the query can be read as “… crop equals oats OR crop equals buckwheat”. Had we used the AND operator, &, instead as in Crop == "Oats" & Crop == "Buckwheat" the output would have returned zero rows since a Crop value cannot be both Oats AND Buckwheat.

We can expand this query by limiting our output to the years 2005 to 2010

library(dplyr)
dat.query3 <- filter(dat, Crop == "Oats" | Crop == "Buckwheat", 
                          Country == "Canada", 
                          Year >= 2005 & Year <= 2010)
summary(dat.query3)
                     Country             Crop                 Information      Year          Value        
 Canada                  :18   Oats        :12   Area harvested (Ha):9    Min.   :2005   Min.   :   2000  
 United States of America: 0   Buckwheat   : 6   Yield (Hg/Ha)      :9    1st Qu.:2006   1st Qu.:  11500  
                               Barley      : 0                            Median :2007   Median :  26615  
                               Canary seed : 0                            Mean   :2007   Mean   : 453810  
                               Grain, mixed: 0                            3rd Qu.:2008   3rd Qu.: 961825  
                               Maize       : 0                            Max.   :2010   Max.   :1815700  
                               (Other)     : 0                                                            
                                      Source 
 Calculated data                         :9  
 FAO data based on imputation methodology:0  
 FAO estimate                            :0  
 Official data                           :9  
                                             
                                             
                                             

Note the use of the AND Boolean operator (&) instead of the OR operator (|) for the Year query. We want the Year value to satisfy two criteria simultaneously: greater than or equal to 2005 AND less than or equal to 2010. Had we used the | operator, R would have returned all years since all year values satisfy at least one of the two criterion.

Sort rows by column value: arrange

You can sort a table based on a column’s values. For example, to sort dat by crop name type:

dat.sort1 <- arrange(dat, Crop)
head(dat.sort1)
                   Country   Crop         Information Year      Value          Source
1                   Canada Barley Area harvested (Ha) 2012 2060000.00   Official data
2                   Canada Barley       Yield (Hg/Ha) 2012   38894.66 Calculated data
3 United States of America Barley Area harvested (Ha) 2012 1312810.00   Official data
4 United States of America Barley       Yield (Hg/Ha) 2012   36533.24 Calculated data
5                   Canada Barley Area harvested (Ha) 2011 2364800.00   Official data
6                   Canada Barley       Yield (Hg/Ha) 2011   32796.43 Calculated data
tail(dat.sort1)
     Country      Crop         Information Year    Value                                   Source
1496  Canada Triticale Area harvested (Ha) 1991  1093.00                            Official data
1497  Canada Triticale       Yield (Hg/Ha) 1991 21957.91                          Calculated data
1498  Canada Triticale Area harvested (Ha) 1990  1074.00 FAO data based on imputation methodology
1499  Canada Triticale       Yield (Hg/Ha) 1990 26396.65                          Calculated data
1500  Canada Triticale Area harvested (Ha) 1989  1093.00                             FAO estimate
1501  Canada Triticale       Yield (Hg/Ha) 1989 21957.91                          Calculated data

By default, arrange sorts by ascending order. To sort by descending order, wrap the column name with the function desc(). For example, to sort the table by Crop in ascending order then by Year in descending order, type:

dat.sort2 <- arrange(dat, Crop, desc(Year))
head(dat.sort2)
                   Country   Crop         Information Year      Value          Source
1                   Canada Barley Area harvested (Ha) 2012 2060000.00   Official data
2                   Canada Barley       Yield (Hg/Ha) 2012   38894.66 Calculated data
3 United States of America Barley Area harvested (Ha) 2012 1312810.00   Official data
4 United States of America Barley       Yield (Hg/Ha) 2012   36533.24 Calculated data
5                   Canada Barley Area harvested (Ha) 2011 2364800.00   Official data
6                   Canada Barley       Yield (Hg/Ha) 2011   32796.43 Calculated data
tail(dat.sort2)
     Country      Crop         Information Year    Value                                   Source
1496  Canada Triticale Area harvested (Ha) 1991  1093.00                            Official data
1497  Canada Triticale       Yield (Hg/Ha) 1991 21957.91                          Calculated data
1498  Canada Triticale Area harvested (Ha) 1990  1074.00 FAO data based on imputation methodology
1499  Canada Triticale       Yield (Hg/Ha) 1990 26396.65                          Calculated data
1500  Canada Triticale Area harvested (Ha) 1989  1093.00                             FAO estimate
1501  Canada Triticale       Yield (Hg/Ha) 1989 21957.91                          Calculated data

Subset by column: select

You can grab a subset of the table by column(s). To extract the columns Crop, Year and Value, type:

dat.subcol <- select(dat, Crop, Year, Value)
head(dat.subcol, 2)
    Crop Year      Value
1 Barley 2012 2060000.00
2 Barley 2012   38894.66

If you want all columns other than Crop, Year and Value, add the negative - symbol before the column name:

dat.subcol <- select(dat, -Crop, -Year, -Value)
head(dat.subcol, 2)
  Country         Information          Source
1  Canada Area harvested (Ha)   Official data
2  Canada       Yield (Hg/Ha) Calculated data

Creating and/or calculating column values: mutate

You can add columns (and compute their values) using the mutate function. For example, to add a column Ctr_abbr and assign it the abbreviated values CAN for Canada and USA for the United States of America based on the values in column Country type:

dat.extended <- mutate(dat, Ctr_abbr = ifelse(Country == "Canada", "CAN", "USA"))
head(dat.extended,3)
  Country      Crop         Information Year      Value          Source Ctr_abbr
1  Canada    Barley Area harvested (Ha) 2012 2060000.00   Official data      CAN
2  Canada    Barley       Yield (Hg/Ha) 2012   38894.66 Calculated data      CAN
3  Canada Buckwheat Area harvested (Ha) 2012       0.00    FAO estimate      CAN
tail(dat.extended,3)
                      Country    Crop         Information Year      Value          Source Ctr_abbr
1499 United States of America     Rye       Yield (Hg/Ha) 1961   11121.79 Calculated data      USA
1500 United States of America Sorghum Area harvested (Ha) 1961 4445000.00   Official data      USA
1501 United States of America Sorghum       Yield (Hg/Ha) 1961   27442.07 Calculated data      USA

Here, we make use of an embedded function, ifelse, which performs a conditional operation: if the Country value is Canada return CAN if not, return USA.

Note that if you wish to rename a column, you can use the rename() function instead of mutate.

You can also use mutate to recompute column values. For example, to replace the Country column values with CAN or USA type:

dat.overwrite <- mutate(dat, Country = ifelse(Country == "Canada", "CAN", "USA"))
head(dat.overwrite,3)
  Country      Crop         Information Year      Value          Source
1     CAN    Barley Area harvested (Ha) 2012 2060000.00   Official data
2     CAN    Barley       Yield (Hg/Ha) 2012   38894.66 Calculated data
3     CAN Buckwheat Area harvested (Ha) 2012       0.00    FAO estimate
tail(dat.overwrite,3)
     Country    Crop         Information Year      Value          Source
1499     USA     Rye       Yield (Hg/Ha) 1961   11121.79 Calculated data
1500     USA Sorghum Area harvested (Ha) 1961 4445000.00   Official data
1501     USA Sorghum       Yield (Hg/Ha) 1961   27442.07 Calculated data

Creating and/or calculating values across muliple columns: mutate_at

You might find yourself wanting to apply a same set of mutate operations across multiple variables. For example, given the following sample dataset,

met <- data.frame(Wind = c(3.4, 5.0, 99, 4.1, 1.5),
                  Dir  = c(181, 220, 15,  15,  99 ),
                  Prec = c(99 , 0.5,  0,  99,  99))
met
  Wind Dir Prec
1  3.4 181 99.0
2  5.0 220  0.5
3 99.0  15  0.0
4  4.1  15 99.0
5  1.5  99 99.0

where the value 99 is a placeholder for a missing value for the variables Wind and Prec but a valid value for Dir, we want to replace all missing values with NA. We could either create two mutate operations as in,

met2 <- mutate( met, Wind = ifelse(Wind == 99, NA, Wind),
                     Prec = ifelse(Prec == 99, NA, Prec))
met2
  Wind Dir Prec
1  3.4 181   NA
2  5.0 220  0.5
3   NA  15  0.0
4  4.1  15   NA
5  1.5  99   NA

or, we could reduce the separate mutate operations into a single mutate_at operation,

met2 <- mutate_at( met, vars(Wind, Prec),
                        funs(ifelse( . == 99, NA, .)))
met2
  Wind Dir Prec
1  3.4 181   NA
2  5.0 220  0.5
3   NA  15  0.0
4  4.1  15   NA
5  1.5  99   NA

The dots . in the funs() function are placeholders for each column listed in the vars() function.

Summarize columns: summarise

You can summarize (or “collapse”) one or more columns using the summarise function. For instance, to get the minimum and maximum years from the Year column, type:

summarise(dat, yr_min = min(Year), yr_max=max(Year))
  yr_min yr_max
1   1961   2012

Combining table manipulation functions using %>%

In most cases, you will find yourself wanting to combine several of dplyr’s data manipulation functions. For example,

dat.yield  <- filter(dat, Information == "Yield (Hg/Ha)", 
                          Crop == "Oats",
                          Year == 2012)
dat.rename <- mutate(dat.yield, Country = ifelse(Country == "Canada", "CAN", "USA"))
dat.final  <- select(dat.rename, Country, Value)

head(dat.final, 3)
  Country    Value
1     CAN 24954.79
2     USA 21974.70

The downside to this approach is the creation of several intermediate objects (e.g. dat.yield and dat.rename). This can make the workflow difficult to follow and clutter your R session with needless intermediate objects.

Another approach to combining dplyr operations is to use the piping operator ,%>%, which daisy chains dplyr operations. So our previous workflow could look like:

dat.final <- dat %>%
  filter(Information == "Yield (Hg/Ha)", 
         Crop == "Oats",
         Year == 2012)  %>% 
  mutate(Country = ifelse(Country == "Canada", "CAN", "USA")) %>%
  select(Country, Value)

head(dat.final, 3)
  Country    Value
1     CAN 24954.79
2     USA 21974.70

The chunk of code can be read as “… with the dat table, filter by …, then mutate …., then select …” with the result from one operation being passed on to the next using the %>% operator. Note that the filter, mutate and select functions do not include the data table name making the chunk of code less cluttered and easier to read. The input data table dat appears just once at the beginning of the pipe.

Conditional statements

The base ifelse

Conditional statements are used when you want to create an output value that is conditioned on an evaluation. For example, if you want to output a value of 1 if an input value is less than 23 and a value of 0 otherwise, you can make use of the ifelse function as follows:

x <- c(12,102, 43, 20, 90, 0, 12, 6)
ifelse(x < 23, 1, 0)
[1] 1 0 0 1 0 1 1 1

The base ifelse function works as expected when the input/output values are numeric or character, but does not work as expected when applied to factors or dates. For example, if you wish to replace one factor level with another, the following example will not return the expected output.

x <- as.factor( c("apple", "banana", "banana", "pear", "apple"))
ifelse(x == "pear", "apple", x)
[1] "1"     "2"     "2"     "apple" "1"    

The output is a character representation of the level number (recall that factors encode level values as numbers behind the scenes, i.e. apple =1, banana=2, etc…). Likewise, if you wish to replace an erroneous date with a missing value you will get:

library(lubridate)
y <- mdy("1/23/2016", "3/2/2016", "12/1/1901", "11/23/2016")
ifelse( year(y) != 2016, NA, y)
[1] 16823 16862    NA 17128

Here, ifelse converts the date object to its internal numeric representation as number of days since 1970.

If you want to ensure that the data type is preserved, a safer alternative is to use dplyr’s if_else function.

dplyr’s if_else

The if_else function (note the underscore _), will preserve data type but does so strictly. For example, the following code will return an error:

if_else( year(y) != 2016, NA, y)
Error: `false` must be type logical, not double

The output data types (NA and y) are not treated the same (by seeing NA, the function assumes that the output must be logical which y is not). The solution is to force NA as a date object by wrapping it with a date function like mdy(NA):

if_else( year(y) != 2016, mdy(NA), y)
[1] "2016-01-23" "2016-03-02" NA           "2016-11-23"

Likewise, if the input vector is of type character, you need to ensure that all output values are characters too.

y <- c("apple", "banana", "banana", "pear", "apple")
if_else( y == "banana", as.character(NA), y)
[1] "apple" NA      NA      "pear"  "apple"

Recall from the Week 02 lecture notes that R has several NA reserved words for different data types (e.g. NA_character_ and NA_integer_ to name a few). So the last chunk of code could have been written as,

if_else( y == "banana", NA_character_, y)
[1] "apple" NA      NA      "pear"  "apple"

Note that there is no reserved word for NA date types in which case you would coerce NA to date using built-in functions like as.Date(NA) or lubridate functions like mdy(NA) as demonstrated in an earlier example.

Recoding factors using recode

When working with factors, however, if_else (as of dplyr version 0.7) will produce the following error:

if_else(x == "pear", "apple", x)
Error: `false` must be type character, not integer

R sees apple as a character and x as a number (i.e. the numeric representation of the factor level). A better option for recoding factors is to use dplyr’s recode function:

recode(x , "pear" = "apple")
[1] apple  banana banana apple  apple 
Levels: apple banana

You can recode more than one factor level. In fact, you can even introduce new levels in the recoding scheme:

recode(x , "pear" = "apple", "banana" = "pineapple" )
[1] apple     pineapple pineapple apple     apple    
Levels: apple pineapple

Note that as with if_else and case_when, recode is strict about preserving data types. So if you want to recode a level to NA make sure to use NA_character_ or as.character(NA).

Replacing NA factor levels

One operation you cannot perform with recode is converting an NA level to another level. For example, the following will generate an error message:

x[2] <- NA
x
[1] apple  <NA>   banana pear   apple 
Levels: apple banana pear
recode(x, NA = "other")
Error: unexpected '=' in "recode(x, NA ="

The simplest solution is to make use of a specialized factor package called forcats and its function, fct_explicit_na.

library(forcats)
x <- fct_explicit_na(x, na_level = "Other")
x
[1] apple  Other  banana pear   apple 
Levels: apple banana pear Other

Changing values based on multiple conditions: case_when

ifelse and if_else work great when a single set of conditions is to be satisfied. But if multiple sets of conditions are to be tested, nested if/else statements become cumbersome and are prone to clerical error. The following code highlights an example of nested if/else statements.

z <- c("deg","F","C", "Temperature")
if_else( z == "deg", "Degree", if_else(z == "F", "Fahrenheit", "Celsius"))
[1] "Degree"     "Fahrenheit" "Celsius"    "Celsius"   

A better alternative is to use case_when.

case_when(z == "deg" ~ "Degree",
          z == "F" ~ "Fahrenheit",
          z == "C" ~ "Celsius",
          TRUE ~ z)
[1] "Degree"      "Fahrenheit"  "Celsius"     "Temperature"

The last parameter, TRUE ~ z instructs R to keep all other values as is (not doing so would return NA values).

As with the if_else function, case_when is strict about data type, so if the input is a character vector it expects the output to be character as well. So if you are to assign NA values in your output, avoid the following mistake.

case_when(z == "deg" ~ NA,  # NA is a logical type, this wil return an error
          z == "F" ~ "Fahrenheit",
          z == "C" ~ "Celsius",
          TRUE ~ z)
Error: must be type logical, not character

As was the case with the earlier if_else example, case_when requires a character NA type. We can use either as.character(NA) or NA_character_.

case_when(z == "deg" ~ NA_character_,
          z == "F" ~ "Fahrenheit",
          z == "C" ~ "Celsius",
          TRUE ~ z)
[1] NA            "Fahrenheit"  "Celsius"     "Temperature"

Note that ifelse, if_else, recode and case_when can all be used inside of a mutate function. For example, to replace Canada and United States of America in variable Country with CAN and USA respectively and to create a new variable called Type which will take on the values of 1, 2 or 3 depending on the values in variable Source, type the following:

dat1 <- dat %>% 
  mutate(Country = recode(Country, "Canada" = "CAN",
                                   "United States of America" = "USA"),
         Type = case_when(Source == "Calculated data" ~ 1,
                          Source == "Official data" ~ 2,
                          TRUE ~ 3)) 
head(dat1)   
  Country         Crop         Information Year      Value          Source Type
1     CAN       Barley Area harvested (Ha) 2012 2060000.00   Official data    2
2     CAN       Barley       Yield (Hg/Ha) 2012   38894.66 Calculated data    1
3     CAN    Buckwheat Area harvested (Ha) 2012       0.00    FAO estimate    3
4     CAN  Canary seed Area harvested (Ha) 2012  101900.00   Official data    2
5     CAN  Canary seed       Yield (Hg/Ha) 2012   12161.92 Calculated data    1
6     CAN Grain, mixed Area harvested (Ha) 2012   57900.00   Official data    2

Miscellaneous

Outputting a vector instead of a table using pull

Piping operations will normally output a table, even if a single value is returned. For example, the following summarization operation returns the total oats yield as a data table:

oats <- dat %>% 
  filter(Crop == "Oats",
         Information == "Yield (Hg/Ha)") %>% 
  summarise(Oats_sum = sum(Value))
oats
  Oats_sum
1  2169334
class(oats)
[1] "data.frame"

There may be times when you want the output as a vector element and not a data table. To output a vector, use the pull() function.

oats <- dat %>% 
  filter(Crop == "Oats",
         Information == "Yield (Hg/Ha)") %>% 
  summarise(Oats_sum = sum(Value)) %>% 
  pull()
oats
[1] 2169334
class(oats)
[1] "numeric"

The pull function can also be used to convert a data table column to a multi-element vector, e.g.:

# This outputs a one column table
yield <- dat %>% 
  filter(Crop == "Oats",
         Information == "Yield (Hg/Ha)") %>% 
  select(Value)

head(yield)
     Value
1 24954.79
2 21974.70
3 29109.36
4 20492.37
5 27364.53
6 23056.62
class(yield)
[1] "data.frame"
# This outputs a multi-element vector
yield <- dat %>% 
  filter(Crop == "Oats",
         Information == "Yield (Hg/Ha)") %>% 
  pull(Value)

head(yield)
[1] 24954.79 21974.70 29109.36 20492.37 27364.53 23056.62
class(yield)
[1] "numeric"