tidyr
v 1.0pivot_longer
optionspivot_wider
optionsNA
This tutorial will cover many of the new and existing functions available in the tidyr
package version 1.0
. It will also discuss changes in some of the more popular functions used in reshaping data tables.
Before continuing, make sure to load the
tidyr
package.
A 2014 Boston (Logan airport) flight data summary table will be used in this example. The summary displays average mean delay times (in minutes) by day of the week and quarter.
df <- data.frame( Weekday = c( "Mon", "Tues", "Wed", "Thurs", "Fri" ),
Q1 = c( 9.9 , 4.9 , 8.8 , 12.2 , 12.2 ),
Q2 = c( 5.4 , 9.7 , 11.1 , 10.2 , 8.1 ),
Q3 = c( 8.8 , 7.9 , 10.2 , 9.2 , 7.9 ),
Q4 = c( 6.9 , 5 , 9.3 , 9.7 , 5.6 ))
Weekday Q1 Q2 Q3 Q4
1 Mon 9.9 5.4 8.8 6.9
2 Tues 4.9 9.7 7.9 5.0
3 Wed 8.8 11.1 10.2 9.3
4 Thurs 12.2 10.2 9.2 9.7
5 Fri 12.2 8.1 7.9 5.6
There are three unique variables: day of week, quarter of year, and mean departure delay.
pivot_longer
To go from wide to long we use the pivot_longer
function. This replaces the gather()
function from earlier versions of tidyr
(<1.0
).
The pivot_longer
function takes at least three arguments:
cols
: list of columns that are to be collapsed. The columns can be referenced by column number or column name. Note that the column enumeration does not need to be wrapped in a c()
function.names_to
: This is the name of the new column which will combine all column names (e.g. Q1, Q2, Q3 and Q4).values_to
: This is the name of the new column which will combine all column values (e.g. average delay times) associated with each variable combination (e.g. Mon and Q1, Tues and Q1, etc…)In our example, the line of code needed to re-express the table into a long form can be written in several ways:
df.long <- pivot_longer(df, cols=2:5, names_to = "Quarter", values_to = "Delay")
# or
df.long <- pivot_longer(df, cols=-1, names_to = "Quarter", values_to = "Delay")
# or
df.long <- pivot_longer(df, cols=Q1:Q4, names_to = "Quarter", values_to = "Delay")
# A tibble: 20 x 3
Weekday Quarter Delay
<fct> <chr> <dbl>
1 Mon Q1 9.9
2 Mon Q2 5.4
3 Mon Q3 8.8
4 Mon Q4 6.9
5 Tues Q1 4.9
6 Tues Q2 9.7
7 Tues Q3 7.9
8 Tues Q4 5
9 Wed Q1 8.8
10 Wed Q2 11.1
11 Wed Q3 10.2
12 Wed Q4 9.3
13 Thurs Q1 12.2
14 Thurs Q2 10.2
15 Thurs Q3 9.2
16 Thurs Q4 9.7
17 Fri Q1 12.2
18 Fri Q2 8.1
19 Fri Q3 7.9
20 Fri Q4 5.6
pivot_wider
If a table is to be used for a visual assessment of the values, a long format may be difficult to work with. A long table can be re-expressed into a wide form by picking the two variables that will define the column names and column values.
Continuing with our earlier example, we will convert df.long
back to a wide format using the pivot_wider()
function. This replaces the spread()
function from earlier versions of tidyr
(<1.0
). The pivot_wider()
function takes at least two arguments:
names_from
: Column whose values will be converted to column names.values_from
: Column whose values will populate the table’s block of cell values.# A tibble: 5 x 5
Weekday Q1 Q2 Q3 Q4
<fct> <dbl> <dbl> <dbl> <dbl>
1 Mon 9.9 5.4 8.8 6.9
2 Tues 4.9 9.7 7.9 5
3 Wed 8.8 11.1 10.2 9.3
4 Thurs 12.2 10.2 9.2 9.7
5 Fri 12.2 8.1 7.9 5.6
pivot_longer
optionsHere’s a subset of median income by sex and by work experience for 2017.
df2 <- data.frame( state = c("Maine", "Massachusetts",
"New Hampshire", "Vermont"),
male_fulltime = c(50329,66066, 59962, 50530),
male_other = c(18099, 18574, 20274, 17709),
female_fulltime = c(40054, 53841, 46178, 42198),
female_other = c(13781, 14981, 15121, 14422))
df2
state male_fulltime male_other female_fulltime female_other
1 Maine 50329 18099 40054 13781
2 Massachusetts 66066 18574 53841 14981
3 New Hampshire 59962 20274 46178 15121
4 Vermont 50530 17709 42198 14422
At first glance, it might seem that we have three variables as in the earlier example, but upon closer examination, we see that we can tease out two variables from the column names: sex (male
and female
) and work experience (fulltime
and other
).
pivot_longer
has an argument, names_sep
, that is passed the character that is used to delimit the two variable values. In our example, this character is _
. Since the column values will be split across two variables we will also need to pass two column names to the names_to
argument.
df2.long <- pivot_longer(df2, cols=-1, names_to = c("sex","work"),
names_sep = "_", values_to = "income")
df2.long
# A tibble: 16 x 4
state sex work income
<fct> <chr> <chr> <dbl>
1 Maine male fulltime 50329
2 Maine male other 18099
3 Maine female fulltime 40054
4 Maine female other 13781
5 Massachusetts male fulltime 66066
6 Massachusetts male other 18574
7 Massachusetts female fulltime 53841
8 Massachusetts female other 14981
9 New Hampshire male fulltime 59962
10 New Hampshire male other 20274
11 New Hampshire female fulltime 46178
12 New Hampshire female other 15121
13 Vermont male fulltime 50530
14 Vermont male other 17709
15 Vermont female fulltime 42198
16 Vermont female other 14422
pivot_wider
optionsContinuing with the df2.long
dataframe, we can spread the long table back to a wide table while combining the sex
and work
variables. We’ll add the names_sep
argument which defines the character to use to separate the two variable names. We’ll use a dot .
separator in this example.
# A tibble: 4 x 5
state male.fulltime male.other female.fulltime female.other
<fct> <dbl> <dbl> <dbl> <dbl>
1 Maine 50329 18099 40054 13781
2 Massachusetts 66066 18574 53841 14981
3 New Hampshire 59962 20274 46178 15121
4 Vermont 50530 17709 42198 14422
If your long table has more than one unique combination of variables, pivot_wider()
will return a list. This differs from the spread()
’s behavior which would have returned an error.
var1 var2 val
1 a x 5
2 a x 3
3 b y 1
4 b y 4
# A tibble: 2 x 3
var1 x y
<fct> <list<dbl>> <list<dbl>>
1 a [2] [0]
2 b [0] [2]
Since the intersections of a
:x
and b
:y
each have two possible values, the function returns a list of values. Assuming that the duplicate records are not an erroneous entry, you will need to instruct the function on how to summarize the multiple values using the values_fn
argument. For example, to return the maximum value, type:
# A tibble: 2 x 3
var1 x y
<fct> <dbl> <dbl>
1 a 3 NA
2 b NA 1
You’ll note the empty cells resulting from there not being a valid combination for a
:y
and b
:x
. You can specify the missing values using the values_fill
argument. For example, to replace NA
with 0
type:
pivot_wider(df3, names_from = var2, values_from = val,
values_fn = list(val = min),
values_fill = list(val = 0))
# A tibble: 2 x 3
var1 x y
<fct> <dbl> <dbl>
1 a 3 0
2 b 0 1
The tidyr
package offers other functions not directly tied to pivoting. For example, to split a column into two or more columns based on a column’s delimited value, use the separate()
function.
# Let's first create a delimited table
df2.long <- pivot_longer(df2, cols=-1, names_to = "var1", values_to = "income")
df2.long
# A tibble: 16 x 3
state var1 income
<fct> <chr> <dbl>
1 Maine male_fulltime 50329
2 Maine male_other 18099
3 Maine female_fulltime 40054
4 Maine female_other 13781
5 Massachusetts male_fulltime 66066
6 Massachusetts male_other 18574
7 Massachusetts female_fulltime 53841
8 Massachusetts female_other 14981
9 New Hampshire male_fulltime 59962
10 New Hampshire male_other 20274
11 New Hampshire female_fulltime 46178
12 New Hampshire female_other 15121
13 Vermont male_fulltime 50530
14 Vermont male_other 17709
15 Vermont female_fulltime 42198
16 Vermont female_other 14422
# Split var1 column into two columns
df2.sep <- separate(df2.long, col = var1, sep = "_", into = c("sex", "work"))
df2.sep
# A tibble: 16 x 4
state sex work income
<fct> <chr> <chr> <dbl>
1 Maine male fulltime 50329
2 Maine male other 18099
3 Maine female fulltime 40054
4 Maine female other 13781
5 Massachusetts male fulltime 66066
6 Massachusetts male other 18574
7 Massachusetts female fulltime 53841
8 Massachusetts female other 14981
9 New Hampshire male fulltime 59962
10 New Hampshire male other 20274
11 New Hampshire female fulltime 46178
12 New Hampshire female other 15121
13 Vermont male fulltime 50530
14 Vermont male other 17709
15 Vermont female fulltime 42198
16 Vermont female other 14422
You can also split delimited values across rows.
# A tibble: 32 x 3
state var1 income
<fct> <chr> <dbl>
1 Maine male 50329
2 Maine fulltime 50329
3 Maine male 18099
4 Maine other 18099
5 Maine female 40054
6 Maine fulltime 40054
7 Maine female 13781
8 Maine other 13781
9 Massachusetts male 66066
10 Massachusetts fulltime 66066
# ... with 22 more rows
Note that the output is a tibble
even if the input is a dataframe.
You can expand rows based on a count column using the uncount()
function. This is the opposite of a group_by(...) %>% count()
operation that tallies up the observations based on a grouping variable. Here, we’ll replicate rows based on the column count
value.
var1 var2 count
1 a x 1
2 b y 3
var1 var2
1 a x
2 b y
2.1 b y
2.2 b y
If you want to add an index column that identifies the replicated rows, add an .id
argument.
var1 var2 id
1 a x 1
2 b y 1
2.1 b y 2
2.2 b y 3
The unite()
function performs the opposite task of separate
. It combines multiple columns into a single column by aggregating their values.
# A tibble: 16 x 3
state var1 income
<fct> <chr> <dbl>
1 Maine male,fulltime 50329
2 Maine male,other 18099
3 Maine female,fulltime 40054
4 Maine female,other 13781
5 Massachusetts male,fulltime 66066
6 Massachusetts male,other 18574
7 Massachusetts female,fulltime 53841
8 Massachusetts female,other 14981
9 New Hampshire male,fulltime 59962
10 New Hampshire male,other 20274
11 New Hampshire female,fulltime 46178
12 New Hampshire female,other 15121
13 Vermont male,fulltime 50530
14 Vermont male,other 17709
15 Vermont female,fulltime 42198
16 Vermont female,other 14422
You can use expand_grid
to automatically generate a table with unique combinations of a set of variable values. For example, to fill a table with a combination of student names and homework assignments, type:
df3.long <- expand_grid(
student = c("Joe", "Jane", "Kim"), # Define all unique student names
assignment = c(paste0("HW", 1:4)), # Define all unique HW assignments
value = NA)
df3.long
# A tibble: 12 x 3
student assignment value
<chr> <chr> <lgl>
1 Joe HW1 NA
2 Joe HW2 NA
3 Joe HW3 NA
4 Joe HW4 NA
5 Jane HW1 NA
6 Jane HW2 NA
7 Jane HW3 NA
8 Jane HW4 NA
9 Kim HW1 NA
10 Kim HW2 NA
11 Kim HW3 NA
12 Kim HW4 NA
We can then create a wide version of the table using pivot_wider
.
# A tibble: 3 x 5
student HW1 HW2 HW3 HW4
<chr> <lgl> <lgl> <lgl> <lgl>
1 Joe NA NA NA NA
2 Jane NA NA NA NA
3 Kim NA NA NA NA
NA
It’s not uncommon to be handed a table with incomplete combinations of observations. For example, the following table gives us yield and data source values for each combination of year and grain type. However, several combinations of year/grain are missing.
df4.long <- data.frame( Year = c(1999,1999,2000,2000,2001,2003,2003,2005),
Grain = c("Oats", "Corn","Oats", "Corn","Oats", "Oats", "Corn","Oats"),
Yield = c(23,45,24,40,20,19,41,22),
Src = c("a","a","b","c","a","a","c","a"),
stringsAsFactors = FALSE)
df4.long
Year Grain Yield Src
1 1999 Oats 23 a
2 1999 Corn 45 a
3 2000 Oats 24 b
4 2000 Corn 40 c
5 2001 Oats 20 a
6 2003 Oats 19 a
7 2003 Corn 41 c
8 2005 Oats 22 a
We are missing records for 2001 and Corn, 2003 and Corn, and data for both grains are missing for 2002 and 2004. To add rows for all missing pairs of year/grain values, use the complete
function. Here, we’ll assign 0
to missing Yield
values and NA
to the Src
values.
df.all <- complete(df4.long, Year=1999:2005, Grain= c("Oats", "Corn"),
fill = list(Yield = 0, Src = NA))
df.all
# A tibble: 14 x 4
Year Grain Yield Src
<dbl> <chr> <dbl> <chr>
1 1999 Corn 45 a
2 1999 Oats 23 a
3 2000 Corn 40 c
4 2000 Oats 24 b
5 2001 Corn 0 <NA>
6 2001 Oats 20 a
7 2002 Corn 0 <NA>
8 2002 Oats 0 <NA>
9 2003 Corn 41 c
10 2003 Oats 19 a
11 2004 Corn 0 <NA>
12 2004 Oats 0 <NA>
13 2005 Corn 0 <NA>
14 2005 Oats 22 a
If you want to show just the missing rows, use dplyr::anti_join()
.
# A tibble: 6 x 4
Year Grain Yield Src
<dbl> <chr> <dbl> <chr>
1 2001 Corn 0 <NA>
2 2002 Corn 0 <NA>
3 2002 Oats 0 <NA>
4 2004 Corn 0 <NA>
5 2004 Oats 0 <NA>
6 2005 Corn 0 <NA>
In the previous example, we had the function automatically add the missing combinations using explicitly defined ranges of values. If you just want to output the missing combinations from the existing set of values in both columns, use the expand()
function.
# A tibble: 10 x 2
Year Grain
<dbl> <chr>
1 1999 Corn
2 1999 Oats
3 2000 Corn
4 2000 Oats
5 2001 Corn
6 2001 Oats
7 2003 Corn
8 2003 Oats
9 2005 Corn
10 2005 Oats
Note that this only outputs the columns of interest. If you need to see the other columns in the output, perform a join.
# A tibble: 10 x 4
Year Grain Yield Src
<dbl> <chr> <dbl> <chr>
1 1999 Corn 45 a
2 1999 Oats 23 a
3 2000 Corn 40 c
4 2000 Oats 24 b
5 2001 Corn NA <NA>
6 2001 Oats 20 a
7 2003 Corn 41 c
8 2003 Oats 19 a
9 2005 Corn NA <NA>
10 2005 Oats 22 a
The fill()
function is used to replace NA
values with the closest non-NA
value in a column. For example, to fill down, set the .direction
argument to "down"
.
Month Year
1 1 2000
2 2 NA
3 3 NA
4 4 NA
5 5 NA
6 6 2001
7 7 NA
8 8 NA
9 9 NA
10 10 NA
11 11 NA
12 12 NA
Month Year
1 1 2000
2 2 2000
3 3 2000
4 4 2000
5 5 2000
6 6 2001
7 7 2001
8 8 2001
9 9 2001
10 10 2001
11 11 2001
12 12 2001
Manuel Gimond, 2019