9  Manipulating data tables with dplyr

dplyr lubridate forcats stringr
1.1.4 1.9.3 1.0.0 1.5.1

The data file FAO_grains_NA.csv will be used in this exercise. This dataset consists of grain yield and harvest year by North American country. The dataset was downloaded from http://faostat3.fao.org/ in June of 2014.

Run the following line to load the FAO data file into your current R session.

dat <- read.csv("http://mgimond.github.io/ES218/Data/FAO_grains_NA.csv", header=TRUE)

Before tackling the examples that follow, make sure to load the dplyr package.

library(dplyr)

9.1 The dplyr basics

The basic set of R tools can accomplish many data table queries, but the syntax can be overwhelming and verbose. The dplyr package offers some nifty and simple querying functions. Some of dplyr’s key data manipulation functions are summarized in the following table:

dplyr function Description
filter() Subset by row values
arrange() Sort rows by column values
select() Subset columns
mutate() Add columns
summarise() Summarize columns

Note that all of these functions take as first argument the data table name except when used in a piping operation (pipes are discussed later in this chapter). For example:

When used alone, dataframe dat is the first argument inside the dplyr function.

dat2 <- select(dat, Crop)

When used in a pipe, dataframe dat is outside of the dplyr function.

dat2 <- dat %>% select(Crop)

If you are using R 4.1 or greater, you can make use of the native pipe.

dat2 <- dat |> select(Crop)

9.1.1 filter: Subset by rows

Tables can be subsetted by rows based on column values. For example, we may wish to grab all rows associated with the value Oats in the Crop column:

dat.query1 <- filter(dat, Crop == "Oats")

We can use the unique() function to identify all unique values in the Crop column. This should return a single unique value: Oats.

unique(dat.query1$Crop)
[1] "Oats"

Note that R is case sensitive, so make sure that you respect each letter’s case (i.e. upper or lower).

We can expand our query by including both Oats, Buckwheat and limiting the Country column to Canada.

dat.query2 <- filter(dat, Crop == "Oats" | Crop == "Buckwheat", 
                          Country == "Canada")
unique(dat.query1$Crop)
[1] "Oats"
unique(dat.query1$Country)
[1] "Canada"                   "United States of America"

The character | is the Boolean operator OR. So in our example, the query can be read as “… crop equals oats OR crop equals buckwheat”. Had we used the AND operator, & as in Crop == "Oats" & Crop == "Buckwheat", the output would have returned zero rows since a Crop value cannot be both Oats AND Buckwheat.

We can expand this query by limiting our output to the years 2005 to 2010

dat.query3 <- filter(dat, Crop == "Oats" | Crop == "Buckwheat", 
                          Country == "Canada", 
                          Year >= 2005 & Year <= 2010)
unique(dat.query3$Crop)
[1] "Oats"      "Buckwheat"
unique(dat.query3$Country)
[1] "Canada"
range(dat.query3$Year)
[1] 2005 2010

Note the use of the AND Boolean operator (&) instead of the OR operator (|) for the Year query. We want the Year value to satisfy two criteria simultaneously: greater than or equal to 2005 AND less than or equal to 2010. Had we used the | operator, R would have returned all years since all year values satisfy at least one of the two critera.

9.1.2 arrange: Sort rows by column value

You can sort a table based on a column’s values using the arrange function. For example, to sort dat by crop name type:

dat.sort1 <- arrange(dat, Crop)
head(dat.sort1)
                   Country   Crop         Information Year      Value
1                   Canada Barley Area harvested (Ha) 2012 2060000.00
2                   Canada Barley       Yield (Hg/Ha) 2012   38894.66
3 United States of America Barley Area harvested (Ha) 2012 1312810.00
4 United States of America Barley       Yield (Hg/Ha) 2012   36533.24
5                   Canada Barley Area harvested (Ha) 2011 2364800.00
6                   Canada Barley       Yield (Hg/Ha) 2011   32796.43
           Source
1   Official data
2 Calculated data
3   Official data
4 Calculated data
5   Official data
6 Calculated data
tail(dat.sort1)
     Country      Crop         Information Year    Value
1496  Canada Triticale Area harvested (Ha) 1991  1093.00
1497  Canada Triticale       Yield (Hg/Ha) 1991 21957.91
1498  Canada Triticale Area harvested (Ha) 1990  1074.00
1499  Canada Triticale       Yield (Hg/Ha) 1990 26396.65
1500  Canada Triticale Area harvested (Ha) 1989  1093.00
1501  Canada Triticale       Yield (Hg/Ha) 1989 21957.91
                                       Source
1496                            Official data
1497                          Calculated data
1498 FAO data based on imputation methodology
1499                          Calculated data
1500                             FAO estimate
1501                          Calculated data

By default, arrange sorts by ascending order. To sort by descending order, wrap the column name with the function desc(). For example, to sort the table by Crop in ascending order, then by Year in descending order, type:

dat.sort2 <- arrange(dat, Crop, desc(Year))
head(dat.sort2)
                   Country   Crop         Information Year      Value
1                   Canada Barley Area harvested (Ha) 2012 2060000.00
2                   Canada Barley       Yield (Hg/Ha) 2012   38894.66
3 United States of America Barley Area harvested (Ha) 2012 1312810.00
4 United States of America Barley       Yield (Hg/Ha) 2012   36533.24
5                   Canada Barley Area harvested (Ha) 2011 2364800.00
6                   Canada Barley       Yield (Hg/Ha) 2011   32796.43
           Source
1   Official data
2 Calculated data
3   Official data
4 Calculated data
5   Official data
6 Calculated data
tail(dat.sort2)
     Country      Crop         Information Year    Value
1496  Canada Triticale Area harvested (Ha) 1991  1093.00
1497  Canada Triticale       Yield (Hg/Ha) 1991 21957.91
1498  Canada Triticale Area harvested (Ha) 1990  1074.00
1499  Canada Triticale       Yield (Hg/Ha) 1990 26396.65
1500  Canada Triticale Area harvested (Ha) 1989  1093.00
1501  Canada Triticale       Yield (Hg/Ha) 1989 21957.91
                                       Source
1496                            Official data
1497                          Calculated data
1498 FAO data based on imputation methodology
1499                          Calculated data
1500                             FAO estimate
1501                          Calculated data

9.1.3 select: Subset by column

You can subset a table by column(s) using the select function. To extract the columns Crop, Year and Value, type:

dat.subcol <- select(dat, Crop, Year, Value)
head(dat.subcol, 2)
    Crop Year      Value
1 Barley 2012 2060000.00
2 Barley 2012   38894.66

If you want all columns other than Crop, Year and Value, add the negative - symbol before the column name:

dat.subcol <- select(dat, -Crop, -Year, -Value)
head(dat.subcol, 2)
  Country         Information          Source
1  Canada Area harvested (Ha)   Official data
2  Canada       Yield (Hg/Ha) Calculated data

9.1.4 mutate: Creating and/or calculating column values

You can add columns (and compute their values) using the mutate function. For example, to add a column Ctr_abbr and assign it the abbreviated values CAN for Canada and USA for the United States of America based on the values in column Country type:

dat.extended <- mutate(dat, Ctr_abbr = ifelse(Country == "Canada", "CAN", "USA"))
head(dat.extended,3)
  Country      Crop         Information Year      Value          Source
1  Canada    Barley Area harvested (Ha) 2012 2060000.00   Official data
2  Canada    Barley       Yield (Hg/Ha) 2012   38894.66 Calculated data
3  Canada Buckwheat Area harvested (Ha) 2012       0.00    FAO estimate
  Ctr_abbr
1      CAN
2      CAN
3      CAN
tail(dat.extended,3)
                      Country    Crop         Information Year      Value
1499 United States of America     Rye       Yield (Hg/Ha) 1961   11121.79
1500 United States of America Sorghum Area harvested (Ha) 1961 4445000.00
1501 United States of America Sorghum       Yield (Hg/Ha) 1961   27442.07
              Source Ctr_abbr
1499 Calculated data      USA
1500   Official data      USA
1501 Calculated data      USA

Here, we make use of an embedded function, ifelse, which performs a conditional operation: if the Country value is Canada return CAN if not, return USA. ifelse is covered in section 9.3.

Note that if you wish to rename a column, you can use the rename() function instead of mutate.

You can also use mutate to recompute column values. For example, to replace the Country column values with CAN or USA type:

dat.overwrite <- mutate(dat, Country = ifelse(Country == "Canada", "CAN", "USA"))
head(dat.overwrite, 3)
  Country      Crop         Information Year      Value          Source
1     CAN    Barley Area harvested (Ha) 2012 2060000.00   Official data
2     CAN    Barley       Yield (Hg/Ha) 2012   38894.66 Calculated data
3     CAN Buckwheat Area harvested (Ha) 2012       0.00    FAO estimate
tail(dat.overwrite, 3)
     Country    Crop         Information Year      Value          Source
1499     USA     Rye       Yield (Hg/Ha) 1961   11121.79 Calculated data
1500     USA Sorghum Area harvested (Ha) 1961 4445000.00   Official data
1501     USA Sorghum       Yield (Hg/Ha) 1961   27442.07 Calculated data

9.1.5 mutate across muliple columns

You might find yourself wanting to apply a same set of mutate operations across multiple variables. For example, given the following sample dataset,

met <- data.frame(Wind = c(3.4, 5.0, 99, 4.1, 1.5),
                  Dir  = c(181, 220, 15,  15,  99 ),
                  Prec = c(99 , 0.5,  0,  99,  99))
met
  Wind Dir Prec
1  3.4 181 99.0
2  5.0 220  0.5
3 99.0  15  0.0
4  4.1  15 99.0
5  1.5  99 99.0

where the value 99 is a placeholder for a missing value for the variables Wind and Prec but a valid value for Dir, we want to replace all missing values with NA. We could either create two mutate operations as in,

met %>%  mutate(Wind = ifelse(Wind == 99, NA, Wind),
                Prec = ifelse(Prec == 99, NA, Prec))
  Wind Dir Prec
1  3.4 181   NA
2  5.0 220  0.5
3   NA  15  0.0
4  4.1  15   NA
5  1.5  99   NA

or, we could reduce the separate mutate operations into a single operation by adding an across() function to its argument.

met %>% mutate(across( c(Wind, Prec),
                       ~ ifelse( . == 99, NA, .)))
  Wind Dir Prec
1  3.4 181   NA
2  5.0 220  0.5
3   NA  15  0.0
4  4.1  15   NA
5  1.5  99   NA

across takes two arguments: the columns the mutate operation is to operate on, and the mutate operation to perform on these columns. Here, the tilde ~ can be interpreted as “to the listed columns, apply the following function …”. The dot . in the ifelse function is a placeholder for each column listed in across’s first argument.

9.1.6 summarise: Summarize columns

You can summarize (or “collapse”) one or more columns using the summarise function. For instance, to get the minimum and maximum years from the Year column, type:

summarise(dat, yr_min = min(Year), yr_max=max(Year))
  yr_min yr_max
1   1961   2012

9.2 Combining data manipulation functions using the pipe %>%

In most cases, you will find yourself wanting to combine several of dplyr’s data manipulation functions. For example,

dat.yield  <- filter(dat, Information == "Yield (Hg/Ha)", 
                          Crop == "Oats",
                          Year == 2012)
dat.rename <- mutate(dat.yield, Country = ifelse(Country == "Canada", "CAN", "USA"))
dat.final  <- select(dat.rename, Country, Value)

head(dat.final, 3)
  Country    Value
1     CAN 24954.79
2     USA 21974.70

The downside to this approach is the creation of several intermediate objects (e.g. dat.yield and dat.rename). This can make the workflow difficult to follow and clutter your R session with needless intermediate objects.

Another approach to combining dplyr operations is to use the piping operator ,%>%, which daisy chains dplyr operations. So our previous workflow could look like:

dat.final <- dat %>%
  filter(Information == "Yield (Hg/Ha)", 
         Crop == "Oats",
         Year == 2012)  %>% 
  mutate(Country = ifelse(Country == "Canada", "CAN", "USA")) %>%
  select(Country, Value)

head(dat.final, 3)
  Country    Value
1     CAN 24954.79
2     USA 21974.70

The chunk of code can be read as “… with the dat table, filter by …, then mutate …., then select …” with the result from one operation being passed on to the next using the %>% operator. Note that the filter, mutate and select functions do not include the data table name making the chunk of code less cluttered and easier to read. The input data table dat appears just once at the beginning of the pipe.

9.2.1 R has a native pipe too

R has recently (as of version 4.1) added its own native pipe to its base function. Its infix operator is written as |>. In most code chunks covered in these tutorials, you can substitute %>% with |>. For example, you can write the previous code chunk as:

dat.final <- dat  |> 
  filter(Information == "Yield (Hg/Ha)", 
         Crop == "Oats",
         Year == 2012)   |>  
  mutate(Country = ifelse(Country == "Canada", "CAN", "USA"))  |> 
  select(Country, Value)

There are, however, a few subtle differences between the two. A deeper dive in how they differ can be found here.

In this course, we’ll stick with the %>% operator given that |> is new and is not yet as widely adopted as %>%.

  

9.2.2 Updating a factor’s levels in a dataframe

Subsetting a dataframe by row might result in a factor losing all values associated with a level. You learned in section 3.3.1.2 that the droplevels function can be used to remove any levels no longer present in the factor. droplevels can also be used in a piping operation, however, the function will drop levels for all factors in the dataframe–this might not be the intended goal.

If you want to selectively drop the levels for a specific set of factors in a dataframe, you might want to make use of the fct_drop function from the forcats package. For example, let’s first create a subset of the built-in mtcars dataset and create factors from a couple of its variables. We’ll use the base methods to tackle this task.

mt <- mtcars[ , c("cyl", "gear")]
mt$cyl  <- factor(mtcars$cyl)
mt$gear <- factor(mtcars$gear)

Now, let’s filter by cyl.

mt_sub <- mt %>% 
   filter(cyl == 8)

Exploring mt_sub’s summary, we see that some levels have no matching values in the factors:

summary(mt_sub)
 cyl    gear  
 4: 0   3:12  
 6: 0   4: 0  
 8:14   5: 2  

Applying droplevels to the dataframe will remove those levels from both factors;

mt_sub <- mt %>% 
   filter(cyl == 8) %>% 
   droplevels()

summary(mt_sub)
 cyl    gear  
 8:14   3:12  
        5: 2  

Using fct_drop controls which factor(s) you want to clean up the levels for. Note, however, that fct_drop must be used inside of a mutate function.

library(forcats) 

mt_sub <- mt %>% 
   filter(cyl == 8) %>% 
   mutate( cyl = fct_drop(cyl))

summary(mt_sub)
 cyl    gear  
 8:14   3:12  
        4: 0  
        5: 2  

9.3 Conditional statements

  

9.3.1 The base ifelse

Conditional statements are used when you want to create an output value that is conditioned on an evaluation. For example, if you want to output a value of 1 if an input value is less than 23 and a value of 0 otherwise, you can make use of the ifelse function as follows:

x <- c(12,102, 43, 20, 90, 0, 12, 6)
ifelse(x < 23, 1, 0)
[1] 1 0 0 1 0 1 1 1

ifelse takes three arguments:

  • The condition to evaluate (x < 23 in the above example);
  • The value to output if the condition is TRUE (1 in our example);
  • The value to output if the condition is FALSE (0 in our example).

The base ifelse function works as expected when the input/output values are numeric or character, but it does not work as expected when applied to factors or dates. For example, if you wish to replace one factor level with another, the following example will not return the expected output.

x <- as.factor( c("apple", "banana", "banana", "pear", "apple"))
ifelse(x == "pear", "apple", x)
[1] "1"     "2"     "2"     "apple" "1"    

The output is a character representation of the level number (recall that factors encode level values as numbers behind the scenes, i.e. apple =1, banana=2, etc…). Likewise, if you wish to replace an erroneous date you will get:

library(lubridate)
y <- mdy("1/23/2016", "3/2/2016", "12/1/1901", "11/23/2016")
ifelse( y == mdy("12/1/1901"), mdy("12/1/2016"), y)
[1] 16823 16862 17136 17128

Here, ifelse converts the date object to its internal numeric representation as number of days since 1970.

9.3.2 dplyr’s if_else

ifelse does not preserve the attributes that might be present in a vector. In other words, it will strip away the vector’s class. A safer alternative is to use dplyr’s if_else function.

Reworking the above examples:

if_else( y == mdy("12/1/1901"), mdy("12/1/2016"), y)
[1] "2016-01-23" "2016-03-02" "2016-12-01" "2016-11-23"

The date class is preserved. Now let’s check the output of a factor.

if_else(x == "pear", "apple", x)
[1] "apple"  "banana" "banana" "apple"  "apple" 

Note that when working with factors, however, if_else will strip the factor class of an input factor. But, instead of returning the factor’s underlying integer values, it outputs the associated levels as a character data type.

The workaround is to convert the if_else output to a factor.

factor(if_else(x == "pear", "apple", x))
[1] apple  banana banana apple  apple 
Levels: apple banana

If you need to explicitly define the levels, add the levels = c("pear",...).

NOTE: dplyr offers the recode function that preserves factors however, this function is being superseded according to the documentation as of version 1.1.4.

9.3.3 Changing values based on multiple conditions: case_when

ifelse and if_else work great when a single set of conditions is to be satisfied. But, if multiple sets of conditions are to be evaluated, nested if/else statements become cumbersome and are prone to clerical error. The following code highlights an example of nested if/else statements.

unit <- c("F","F","C", "K")
if_else( unit == "C", "Celsius", if_else(unit == "F", "Fahrenheit", "Kelvin"))
[1] "Fahrenheit" "Fahrenheit" "Celsius"    "Kelvin"    

A simpler solution is to use the case_when function.

case_when(unit == "C" ~ "Celsius",
          unit == "F" ~ "Fahrenheit",
          unit == "K" ~ "Kelvin")
[1] "Fahrenheit" "Fahrenheit" "Celsius"    "Kelvin"    

case_when can aso be used for more complex operations. For example, given two vectors, unit and temp, we would like to convert all temp values to Fahrenheit by applying a temperature conversion dependent on the unit value.

temp <- c(45.2, 56.3, 11.0, 285)
case_when(unit == "F" ~ temp,
          unit == "C" ~ (temp * 9/5) + 32,
          TRUE ~ (temp - 273.15) * 9/5 + 32)
[1] 45.20 56.30 51.80 53.33

The last argument, TRUE ~, applies to all conditions not satisfied by the previous two conditions (otherwise, not doing so would return NA values by default). You only need to add a TRUE ~ condition if you know that all previously listed conditions may not cover all possible outcomes. Here, we know that some observations are associated with unit == "K" yet that condition is not explicitly defined in the case_when arguments. We could have, of course, added the unit == "K" condition to the above code chunk thus alleviating the need for the TRUE ~ condition.

Note that the order in which these conditions are listed matters since evaluation stops at the first TRUE outcome encountered. So, had the last condition been moved to the top of the stack, all temp values would be assigned the first conversion option.

# What not to do ...
case_when(TRUE ~ (temp - 273.15) * 9/5 + 32,
          unit == "F" ~ temp,
          unit == "C" ~ (temp * 9/5) + 32)
[1] -378.31 -358.33 -439.87   53.33

Note that case_when can also be used inside of a mutate function. For example, to replace Canada and United States of America in variable Country with CAN and USA respectively and to create a new variable called Type which will take on the values of 1, 2 or 3 depending on the values in variable Source, type the following:

dat1 <- dat %>% 
  mutate(Country = case_when(Country == "Canada" ~ "CAN",
                             Country == "United States of America" ~ "USA"),
         Type = case_when(Source == "Calculated data" ~ 1,
                          Source == "Official data" ~ 2,
                          TRUE ~ 3)) 
head(dat1)   
  Country         Crop         Information Year      Value          Source Type
1     CAN       Barley Area harvested (Ha) 2012 2060000.00   Official data    2
2     CAN       Barley       Yield (Hg/Ha) 2012   38894.66 Calculated data    1
3     CAN    Buckwheat Area harvested (Ha) 2012       0.00    FAO estimate    3
4     CAN  Canary seed Area harvested (Ha) 2012  101900.00   Official data    2
5     CAN  Canary seed       Yield (Hg/Ha) 2012   12161.92 Calculated data    1
6     CAN Grain, mixed Area harvested (Ha) 2012   57900.00   Official data    2

  

9.4 Replacing values with NA

So far, we’ve used the ifelse or if_else functions to replace certain values with NA. dplyr offers the na_if() function to simplify the syntax. For example, to replace -999 with NA:

val <- c(-999, 6, -1, -999)
na_if( val , -999 )
[1] NA  6 -1 NA

Likewise, to replace all empty character values:

val <- c("ab", "", "A b", "  ")
na_if( val , "" )
[1] "ab"  NA    "A b" "  " 

na_if will also preserve the object’s class. For example:

x <- as.factor( c("apple", "walnut", "banana", "pear", "apple"))
na_if(x , "walnut")
[1] apple  <NA>   banana pear   apple 
Levels: apple banana pear walnut

But, note that it does not automatically drop the level being replaced with NA.

na_if also works with dates, but don’t forget to evaluate a date object with a date value. For example, to replace dates of 12/1/1901 with NA, we need to make a date object of that value. Here, we’ll make use of the mdy() function as in mdy("12/1/1901").

y <- mdy("1/23/2016", "3/2/2016", "12/1/1901", "11/23/2016")
na_if(y, mdy("12/1/1901"))
[1] "2016-01-23" "2016-03-02" NA           "2016-11-23"

To use na_if() in a piping operation, it needs to be embedded in a mutate() function. For example, to replace "Calculated data" with NA in the dat dataframe, type:

dat1 <- dat %>% 
    mutate(Source = na_if(Source, "Calculated data" ))
unique(dat1$Source)
[1] "Official data"                           
[2] NA                                        
[3] "FAO estimate"                            
[4] "FAO data based on imputation methodology"

9.5 Outputting a vector instead of a table using pull

Piping operations will output a table, even if a single value is returned. For example, the following summarization operation returns the total oats yield as a dataframe:

oats <- dat %>% 
  filter(Crop == "Oats",
         Information == "Yield (Hg/Ha)") %>% 
  summarise(Oats_sum = sum(Value))
oats
  Oats_sum
1  2169334
class(oats)
[1] "data.frame"

There may be times when you want to output a vector element and not a dataframe. To output a vector, use the pull() function.

oats <- dat %>% 
  filter(Crop == "Oats",
         Information == "Yield (Hg/Ha)") %>% 
  summarise(Oats_sum = sum(Value)) %>% 
  pull()
oats
[1] 2169334
class(oats)
[1] "numeric"

The pull function can also be used to explicitly define the column to extract. For example, to extract the Value column type:

# This outputs a multi-element vector
yield <- dat %>% 
  filter(Crop == "Oats",
         Information == "Yield (Hg/Ha)") %>% 
  pull(Value)
head(yield)
[1] 24954.79 21974.70 29109.36 20492.37 27364.53 23056.62
class(yield)
[1] "numeric"