dplyr | lubridate | forcats | stringr |
---|---|---|---|
1.1.4 | 1.9.3 | 1.0.0 | 1.5.1 |
9 Manipulating data tables with dplyr
The data file FAO_grains_NA.csv will be used in this exercise. This dataset consists of grain yield and harvest year by North American country. The dataset was downloaded from http://faostat3.fao.org/ in June of 2014.
Run the following line to load the FAO data file into your current R session.
<- read.csv("http://mgimond.github.io/ES218/Data/FAO_grains_NA.csv", header=TRUE) dat
Before tackling the examples that follow, make sure to load the dplyr
package.
library(dplyr)
9.1 The dplyr basics
The basic set of R tools can accomplish many data table queries, but the syntax can be overwhelming and verbose. The dplyr
package offers some nifty and simple querying functions. Some of dplyr
’s key data manipulation functions are summarized in the following table:
dplyr function |
Description |
---|---|
filter() |
Subset by row values |
arrange() |
Sort rows by column values |
select() |
Subset columns |
mutate() |
Add columns |
summarise() |
Summarize columns |
Note that all of these functions take as first argument the data table name except when used in a piping operation (pipes are discussed later in this chapter). For example:
When used alone, dataframe dat
is the first argument inside the dplyr
function.
<- select(dat, Crop) dat2
When used in a pipe, dataframe dat
is outside of the dplyr
function.
<- dat %>% select(Crop) dat2
If you are using R 4.1 or greater, you can make use of the native pipe.
<- dat |> select(Crop) dat2
9.1.1 filter
: Subset by rows
Tables can be subsetted by rows based on column values. For example, we may wish to grab all rows associated with the value Oats
in the Crop
column:
<- filter(dat, Crop == "Oats") dat.query1
We can use the unique()
function to identify all unique values in the Crop
column. This should return a single unique value: Oats
.
unique(dat.query1$Crop)
[1] "Oats"
Note that R is case sensitive, so make sure that you respect each letter’s case (i.e. upper or lower).
We can expand our query by including both Oats
, Buckwheat
and limiting the Country
column to Canada
.
<- filter(dat, Crop == "Oats" | Crop == "Buckwheat",
dat.query2 == "Canada") Country
unique(dat.query1$Crop)
[1] "Oats"
unique(dat.query1$Country)
[1] "Canada" "United States of America"
The character |
is the Boolean operator OR. So in our example, the query can be read as “… crop equals oats OR crop equals buckwheat”. Had we used the AND operator, &
as in Crop == "Oats" & Crop == "Buckwheat"
, the output would have returned zero rows since a Crop
value cannot be both Oats
AND Buckwheat
.
We can expand this query by limiting our output to the years 2005 to 2010
<- filter(dat, Crop == "Oats" | Crop == "Buckwheat",
dat.query3 == "Canada",
Country >= 2005 & Year <= 2010) Year
unique(dat.query3$Crop)
[1] "Oats" "Buckwheat"
unique(dat.query3$Country)
[1] "Canada"
range(dat.query3$Year)
[1] 2005 2010
Note the use of the AND Boolean operator (&
) instead of the OR operator (|
) for the Year
query. We want the Year
value to satisfy two criteria simultaneously: greater than or equal to 2005 AND less than or equal to 2010. Had we used the |
operator, R would have returned all years since all year values satisfy at least one of the two critera.
9.1.2 arrange
: Sort rows by column value
You can sort a table based on a column’s values using the arrange
function. For example, to sort dat
by crop name type:
<- arrange(dat, Crop)
dat.sort1 head(dat.sort1)
Country Crop Information Year Value
1 Canada Barley Area harvested (Ha) 2012 2060000.00
2 Canada Barley Yield (Hg/Ha) 2012 38894.66
3 United States of America Barley Area harvested (Ha) 2012 1312810.00
4 United States of America Barley Yield (Hg/Ha) 2012 36533.24
5 Canada Barley Area harvested (Ha) 2011 2364800.00
6 Canada Barley Yield (Hg/Ha) 2011 32796.43
Source
1 Official data
2 Calculated data
3 Official data
4 Calculated data
5 Official data
6 Calculated data
tail(dat.sort1)
Country Crop Information Year Value
1496 Canada Triticale Area harvested (Ha) 1991 1093.00
1497 Canada Triticale Yield (Hg/Ha) 1991 21957.91
1498 Canada Triticale Area harvested (Ha) 1990 1074.00
1499 Canada Triticale Yield (Hg/Ha) 1990 26396.65
1500 Canada Triticale Area harvested (Ha) 1989 1093.00
1501 Canada Triticale Yield (Hg/Ha) 1989 21957.91
Source
1496 Official data
1497 Calculated data
1498 FAO data based on imputation methodology
1499 Calculated data
1500 FAO estimate
1501 Calculated data
By default, arrange
sorts by ascending order. To sort by descending order, wrap the column name with the function desc()
. For example, to sort the table by Crop
in ascending order, then by Year
in descending order, type:
<- arrange(dat, Crop, desc(Year))
dat.sort2 head(dat.sort2)
Country Crop Information Year Value
1 Canada Barley Area harvested (Ha) 2012 2060000.00
2 Canada Barley Yield (Hg/Ha) 2012 38894.66
3 United States of America Barley Area harvested (Ha) 2012 1312810.00
4 United States of America Barley Yield (Hg/Ha) 2012 36533.24
5 Canada Barley Area harvested (Ha) 2011 2364800.00
6 Canada Barley Yield (Hg/Ha) 2011 32796.43
Source
1 Official data
2 Calculated data
3 Official data
4 Calculated data
5 Official data
6 Calculated data
tail(dat.sort2)
Country Crop Information Year Value
1496 Canada Triticale Area harvested (Ha) 1991 1093.00
1497 Canada Triticale Yield (Hg/Ha) 1991 21957.91
1498 Canada Triticale Area harvested (Ha) 1990 1074.00
1499 Canada Triticale Yield (Hg/Ha) 1990 26396.65
1500 Canada Triticale Area harvested (Ha) 1989 1093.00
1501 Canada Triticale Yield (Hg/Ha) 1989 21957.91
Source
1496 Official data
1497 Calculated data
1498 FAO data based on imputation methodology
1499 Calculated data
1500 FAO estimate
1501 Calculated data
9.1.3 select
: Subset by column
You can subset a table by column(s) using the select
function. To extract the columns Crop
, Year
and Value
, type:
<- select(dat, Crop, Year, Value)
dat.subcol head(dat.subcol, 2)
Crop Year Value
1 Barley 2012 2060000.00
2 Barley 2012 38894.66
If you want all columns other than Crop
, Year
and Value
, add the negative -
symbol before the column name:
<- select(dat, -Crop, -Year, -Value)
dat.subcol head(dat.subcol, 2)
Country Information Source
1 Canada Area harvested (Ha) Official data
2 Canada Yield (Hg/Ha) Calculated data
9.1.4 mutate
: Creating and/or calculating column values
You can add columns (and compute their values) using the mutate
function. For example, to add a column Ctr_abbr
and assign it the abbreviated values CAN
for Canada and USA
for the United States of America based on the values in column Country
type:
<- mutate(dat, Ctr_abbr = ifelse(Country == "Canada", "CAN", "USA"))
dat.extended head(dat.extended,3)
Country Crop Information Year Value Source
1 Canada Barley Area harvested (Ha) 2012 2060000.00 Official data
2 Canada Barley Yield (Hg/Ha) 2012 38894.66 Calculated data
3 Canada Buckwheat Area harvested (Ha) 2012 0.00 FAO estimate
Ctr_abbr
1 CAN
2 CAN
3 CAN
tail(dat.extended,3)
Country Crop Information Year Value
1499 United States of America Rye Yield (Hg/Ha) 1961 11121.79
1500 United States of America Sorghum Area harvested (Ha) 1961 4445000.00
1501 United States of America Sorghum Yield (Hg/Ha) 1961 27442.07
Source Ctr_abbr
1499 Calculated data USA
1500 Official data USA
1501 Calculated data USA
Here, we make use of an embedded function, ifelse
, which performs a conditional operation: if the Country
value is Canada
return CAN
if not, return USA
. ifelse
is covered in section 9.3.
Note that if you wish to rename a column, you can use the rename()
function instead of mutate
.
You can also use mutate
to recompute column values. For example, to replace the Country
column values with CAN
or USA
type:
<- mutate(dat, Country = ifelse(Country == "Canada", "CAN", "USA"))
dat.overwrite head(dat.overwrite, 3)
Country Crop Information Year Value Source
1 CAN Barley Area harvested (Ha) 2012 2060000.00 Official data
2 CAN Barley Yield (Hg/Ha) 2012 38894.66 Calculated data
3 CAN Buckwheat Area harvested (Ha) 2012 0.00 FAO estimate
tail(dat.overwrite, 3)
Country Crop Information Year Value Source
1499 USA Rye Yield (Hg/Ha) 1961 11121.79 Calculated data
1500 USA Sorghum Area harvested (Ha) 1961 4445000.00 Official data
1501 USA Sorghum Yield (Hg/Ha) 1961 27442.07 Calculated data
9.1.5 mutate
across muliple columns
You might find yourself wanting to apply a same set of mutate operations across multiple variables. For example, given the following sample dataset,
<- data.frame(Wind = c(3.4, 5.0, 99, 4.1, 1.5),
met Dir = c(181, 220, 15, 15, 99 ),
Prec = c(99 , 0.5, 0, 99, 99))
met
Wind Dir Prec
1 3.4 181 99.0
2 5.0 220 0.5
3 99.0 15 0.0
4 4.1 15 99.0
5 1.5 99 99.0
where the value 99
is a placeholder for a missing value for the variables Wind
and Prec
but a valid value for Dir
, we want to replace all missing values with NA
. We could either create two mutate operations as in,
%>% mutate(Wind = ifelse(Wind == 99, NA, Wind),
met Prec = ifelse(Prec == 99, NA, Prec))
Wind Dir Prec
1 3.4 181 NA
2 5.0 220 0.5
3 NA 15 0.0
4 4.1 15 NA
5 1.5 99 NA
or, we could reduce the separate mutate
operations into a single operation by adding an across()
function to its argument.
%>% mutate(across( c(Wind, Prec),
met ~ ifelse( . == 99, NA, .)))
Wind Dir Prec
1 3.4 181 NA
2 5.0 220 0.5
3 NA 15 0.0
4 4.1 15 NA
5 1.5 99 NA
across
takes two arguments: the columns the mutate operation is to operate on, and the mutate operation to perform on these columns. Here, the tilde ~
can be interpreted as “to the listed columns, apply the following function …”. The dot .
in the ifelse
function is a placeholder for each column listed in across
’s first argument.
9.1.6 summarise
: Summarize columns
You can summarize (or “collapse”) one or more columns using the summarise
function. For instance, to get the minimum and maximum years from the Year
column, type:
summarise(dat, yr_min = min(Year), yr_max=max(Year))
yr_min yr_max
1 1961 2012
9.2 Combining data manipulation functions using the pipe %>%
In most cases, you will find yourself wanting to combine several of dplyr
’s data manipulation functions. For example,
<- filter(dat, Information == "Yield (Hg/Ha)",
dat.yield == "Oats",
Crop == 2012)
Year <- mutate(dat.yield, Country = ifelse(Country == "Canada", "CAN", "USA"))
dat.rename <- select(dat.rename, Country, Value)
dat.final
head(dat.final, 3)
Country Value
1 CAN 24954.79
2 USA 21974.70
The downside to this approach is the creation of several intermediate objects (e.g. dat.yield
and dat.rename
). This can make the workflow difficult to follow and clutter your R session with needless intermediate objects.
Another approach to combining dplyr
operations is to use the piping operator ,%>%
, which daisy chains dplyr
operations. So our previous workflow could look like:
<- dat %>%
dat.final filter(Information == "Yield (Hg/Ha)",
== "Oats",
Crop == 2012) %>%
Year mutate(Country = ifelse(Country == "Canada", "CAN", "USA")) %>%
select(Country, Value)
head(dat.final, 3)
Country Value
1 CAN 24954.79
2 USA 21974.70
The chunk of code can be read as “… with the dat
table, filter
by …, then mutate
…., then select
…” with the result from one operation being passed on to the next using the %>%
operator. Note that the filter
, mutate
and select
functions do not include the data table name making the chunk of code less cluttered and easier to read. The input data table dat
appears just once at the beginning of the pipe.
9.2.1 R has a native pipe too
R has recently (as of version 4.1
) added its own native pipe to its base function. Its infix operator is written as |>
. In most code chunks covered in these tutorials, you can substitute %>%
with |>
. For example, you can write the previous code chunk as:
<- dat |>
dat.final filter(Information == "Yield (Hg/Ha)",
== "Oats",
Crop == 2012) |>
Year mutate(Country = ifelse(Country == "Canada", "CAN", "USA")) |>
select(Country, Value)
There are, however, a few subtle differences between the two. A deeper dive in how they differ can be found here.
In this course, we’ll stick with the %>%
operator given that |>
is new and is not yet as widely adopted as %>%
.
9.2.2 Updating a factor’s levels in a dataframe
Subsetting a dataframe by row might result in a factor losing all values associated with a level. You learned in section 3.3.1.2
that the droplevels
function can be used to remove any levels no longer present in the factor. droplevels
can also be used in a piping operation, however, the function will drop levels for all factors in the dataframe–this might not be the intended goal.
If you want to selectively drop the levels for a specific set of factors in a dataframe, you might want to make use of the fct_drop
function from the forcats
package. For example, let’s first create a subset of the built-in mtcars
dataset and create factors from a couple of its variables. We’ll use the base methods to tackle this task.
<- mtcars[ , c("cyl", "gear")]
mt $cyl <- factor(mtcars$cyl)
mt$gear <- factor(mtcars$gear) mt
Now, let’s filter by cyl
.
<- mt %>%
mt_sub filter(cyl == 8)
Exploring mt_sub
’s summary, we see that some levels have no matching values in the factors:
summary(mt_sub)
cyl gear
4: 0 3:12
6: 0 4: 0
8:14 5: 2
Applying droplevels
to the dataframe will remove those levels from both factors;
<- mt %>%
mt_sub filter(cyl == 8) %>%
droplevels()
summary(mt_sub)
cyl gear
8:14 3:12
5: 2
Using fct_drop
controls which factor(s) you want to clean up the levels for. Note, however, that fct_drop
must be used inside of a mutate
function.
library(forcats)
<- mt %>%
mt_sub filter(cyl == 8) %>%
mutate( cyl = fct_drop(cyl))
summary(mt_sub)
cyl gear
8:14 3:12
4: 0
5: 2
9.3 Conditional statements
9.3.1 The base ifelse
Conditional statements are used when you want to create an output value that is conditioned on an evaluation. For example, if you want to output a value of 1
if an input value is less than 23
and a value of 0
otherwise, you can make use of the ifelse
function as follows:
<- c(12,102, 43, 20, 90, 0, 12, 6)
x ifelse(x < 23, 1, 0)
[1] 1 0 0 1 0 1 1 1
ifelse
takes three arguments:
- The condition to evaluate (
x < 23
in the above example); - The value to output if the condition is
TRUE
(1
in our example); - The value to output if the condition is
FALSE
(0
in our example).
The base ifelse
function works as expected when the input/output values are numeric or character, but it does not work as expected when applied to factors or dates. For example, if you wish to replace one factor level with another, the following example will not return the expected output.
<- as.factor( c("apple", "banana", "banana", "pear", "apple"))
x ifelse(x == "pear", "apple", x)
[1] "1" "2" "2" "apple" "1"
The output is a character representation of the level number (recall that factors encode level values as numbers behind the scenes, i.e. apple
=1, banana
=2, etc…). Likewise, if you wish to replace an erroneous date you will get:
library(lubridate)
<- mdy("1/23/2016", "3/2/2016", "12/1/1901", "11/23/2016")
y ifelse( y == mdy("12/1/1901"), mdy("12/1/2016"), y)
[1] 16823 16862 17136 17128
Here, ifelse
converts the date object to its internal numeric representation as number of days since 1970.
9.3.2 dplyr
’s if_else
ifelse
does not preserve the attributes that might be present in a vector. In other words, it will strip away the vector’s class
. A safer alternative is to use dplyr
’s if_else
function.
Reworking the above examples:
if_else( y == mdy("12/1/1901"), mdy("12/1/2016"), y)
[1] "2016-01-23" "2016-03-02" "2016-12-01" "2016-11-23"
The date class is preserved. Now let’s check the output of a factor.
if_else(x == "pear", "apple", x)
[1] "apple" "banana" "banana" "apple" "apple"
Note that when working with factors, however, if_else
will strip the factor
class of an input factor. But, instead of returning the factor’s underlying integer values, it outputs the associated levels as a character
data type.
The workaround is to convert the if_else
output to a factor.
factor(if_else(x == "pear", "apple", x))
[1] apple banana banana apple apple
Levels: apple banana
If you need to explicitly define the levels, add the levels = c("pear",...)
.
NOTE: dplyr
offers the recode
function that preserves factors however, this function is being superseded according to the documentation as of version 1.1.4
.
9.3.3 Changing values based on multiple conditions: case_when
ifelse
and if_else
work great when a single set of conditions is to be satisfied. But, if multiple sets of conditions are to be evaluated, nested if/else statements become cumbersome and are prone to clerical error. The following code highlights an example of nested if/else statements.
<- c("F","F","C", "K")
unit if_else( unit == "C", "Celsius", if_else(unit == "F", "Fahrenheit", "Kelvin"))
[1] "Fahrenheit" "Fahrenheit" "Celsius" "Kelvin"
A simpler solution is to use the case_when
function.
case_when(unit == "C" ~ "Celsius",
== "F" ~ "Fahrenheit",
unit == "K" ~ "Kelvin") unit
[1] "Fahrenheit" "Fahrenheit" "Celsius" "Kelvin"
case_when
can aso be used for more complex operations. For example, given two vectors, unit
and temp
, we would like to convert all temp
values to Fahrenheit by applying a temperature conversion dependent on the unit
value.
<- c(45.2, 56.3, 11.0, 285) temp
case_when(unit == "F" ~ temp,
== "C" ~ (temp * 9/5) + 32,
unit TRUE ~ (temp - 273.15) * 9/5 + 32)
[1] 45.20 56.30 51.80 53.33
The last argument, TRUE ~
, applies to all conditions not satisfied by the previous two conditions (otherwise, not doing so would return NA
values by default). You only need to add a TRUE ~
condition if you know that all previously listed conditions may not cover all possible outcomes. Here, we know that some observations are associated with unit == "K"
yet that condition is not explicitly defined in the case_when
arguments. We could have, of course, added the unit == "K"
condition to the above code chunk thus alleviating the need for the TRUE ~
condition.
Note that the order in which these conditions are listed matters since evaluation stops at the first TRUE
outcome encountered. So, had the last condition been moved to the top of the stack, all temp
values would be assigned the first conversion option.
# What not to do ...
case_when(TRUE ~ (temp - 273.15) * 9/5 + 32,
== "F" ~ temp,
unit == "C" ~ (temp * 9/5) + 32) unit
[1] -378.31 -358.33 -439.87 53.33
Note that case_when
can also be used inside of a mutate
function. For example, to replace Canada
and United States of America
in variable Country
with CAN
and USA
respectively and to create a new variable called Type
which will take on the values of 1
, 2
or 3
depending on the values in variable Source
, type the following:
<- dat %>%
dat1 mutate(Country = case_when(Country == "Canada" ~ "CAN",
== "United States of America" ~ "USA"),
Country Type = case_when(Source == "Calculated data" ~ 1,
== "Official data" ~ 2,
Source TRUE ~ 3))
head(dat1)
Country Crop Information Year Value Source Type
1 CAN Barley Area harvested (Ha) 2012 2060000.00 Official data 2
2 CAN Barley Yield (Hg/Ha) 2012 38894.66 Calculated data 1
3 CAN Buckwheat Area harvested (Ha) 2012 0.00 FAO estimate 3
4 CAN Canary seed Area harvested (Ha) 2012 101900.00 Official data 2
5 CAN Canary seed Yield (Hg/Ha) 2012 12161.92 Calculated data 1
6 CAN Grain, mixed Area harvested (Ha) 2012 57900.00 Official data 2
9.4 Replacing values with NA
So far, we’ve used the ifelse
or if_else
functions to replace certain values with NA
. dplyr
offers the na_if()
function to simplify the syntax. For example, to replace -999
with NA
:
<- c(-999, 6, -1, -999)
val na_if( val , -999 )
[1] NA 6 -1 NA
Likewise, to replace all empty character values:
<- c("ab", "", "A b", " ")
val na_if( val , "" )
[1] "ab" NA "A b" " "
na_if
will also preserve the object’s class. For example:
<- as.factor( c("apple", "walnut", "banana", "pear", "apple"))
x na_if(x , "walnut")
[1] apple <NA> banana pear apple
Levels: apple banana pear walnut
But, note that it does not automatically drop the level being replaced with NA
.
na_if
also works with dates, but don’t forget to evaluate a date object with a date value. For example, to replace dates of 12/1/1901
with NA
, we need to make a date object of that value. Here, we’ll make use of the mdy()
function as in mdy("12/1/1901")
.
<- mdy("1/23/2016", "3/2/2016", "12/1/1901", "11/23/2016")
y na_if(y, mdy("12/1/1901"))
[1] "2016-01-23" "2016-03-02" NA "2016-11-23"
To use na_if()
in a piping operation, it needs to be embedded in a mutate()
function. For example, to replace "Calculated data"
with NA
in the dat
dataframe, type:
<- dat %>%
dat1 mutate(Source = na_if(Source, "Calculated data" ))
unique(dat1$Source)
[1] "Official data"
[2] NA
[3] "FAO estimate"
[4] "FAO data based on imputation methodology"
9.5 Outputting a vector instead of a table using pull
Piping operations will output a table, even if a single value is returned. For example, the following summarization operation returns the total oats yield as a dataframe:
<- dat %>%
oats filter(Crop == "Oats",
== "Yield (Hg/Ha)") %>%
Information summarise(Oats_sum = sum(Value))
oats
Oats_sum
1 2169334
class(oats)
[1] "data.frame"
There may be times when you want to output a vector element and not a dataframe. To output a vector, use the pull()
function.
<- dat %>%
oats filter(Crop == "Oats",
== "Yield (Hg/Ha)") %>%
Information summarise(Oats_sum = sum(Value)) %>%
pull()
oats
[1] 2169334
class(oats)
[1] "numeric"
The pull
function can also be used to explicitly define the column to extract. For example, to extract the Value
column type:
# This outputs a multi-element vector
<- dat %>%
yield filter(Crop == "Oats",
== "Yield (Hg/Ha)") %>%
Information pull(Value)
head(yield)
[1] 24954.79 21974.70 29109.36 20492.37 27364.53 23056.62
class(yield)
[1] "numeric"