This tutorial makes use of the following R package(s): dplyr.

The data file FAO_grains_NA.csv will be used in this exercise. This dataset consists of grain yield and harvest year by North American country. The dataset was downloaded from http://faostat3.fao.org/ in June of 2014.

Run the following line to load the FAO data file into your current R session.

dat <- read.csv("http://mgimond.github.io/ES218/Data/FAO_grains_NA.csv", header=TRUE)

Make sure to load the dplyr package before proceeding with the following examples.

library(dplyr)

Splitting data operations by group: group_by

The group_by function will split any operations applied to the dataframe into groups defined by one or more columns. For example, if we wanted to get the minimum and maximum years from the Year column for which crop data are available by crop type, we would type the following:

dat %>% 
  group_by(Crop) %>% 
  summarise(yr_min = min(Year), yr_max=max(Year))
# A tibble: 11 x 3
           Crop yr_min yr_max
         <fctr>  <dbl>  <dbl>
 1       Barley   1961   2012
 2    Buckwheat   1961   2012
 3  Canary seed   1980   2012
 4 Grain, mixed   1961   2012
 5        Maize   1961   2012
 6       Millet   1961   2012
 7         Oats   1961   2012
 8      Popcorn   1961   1982
 9          Rye   1961   2012
10      Sorghum   1961   2012
11    Triticale   1989   2012

Count the number of records in each group

In this example, we are identifying the number of records by Crop type. There are two ways this can be accomplished:

dat %>%
  filter(Information == "Yield (Hg/Ha)", 
         Year >= 2005 & Year <=2010, 
         Country=="United States of America") %>%
  group_by(Crop) %>%
  count()

Or,

dat %>%
  filter(Information == "Yield (Hg/Ha)", 
         Year >= 2005 & Year <=2010, 
         Country=="United States of America") %>%
  group_by(Crop) %>%
  summarise(Count = n())
# A tibble: 7 x 2
       Crop Count
     <fctr> <int>
1    Barley     6
2 Buckwheat     6
3     Maize     6
4    Millet     6
5      Oats     6
6       Rye     6
7   Sorghum     6

The former uses the count() function and the latter uses the summarise() and n() functions.

Summarize by mean yield and year range

Here’s another example where two variables are summarized in a single pipe.

dat.grp <- dat %>%
  filter(Information == "Yield (Hg/Ha)", 
         Year >= 2005 & Year <=2010, 
         Country=="United States of America") %>%
  group_by(Crop) %>%
  summarise( Yield = mean(Value), `Number of Years` = max(Year) - min(Year)) 

dat.grp
# A tibble: 7 x 3
  Crop       Yield `Number of Years`
  <fct>      <dbl>             <dbl>
1 Barley    35471.                 5
2 Buckwheat 10418.                 5
3 Maize     96151.                 5
4 Millet    16548.                 5
5 Oats      22619.                 5
6 Rye       17132.                 5
7 Sorghum   42258.                 5

Normalizing each value in a group by the group median

In this example, we are subtracting each value in a group by that group’s median. This can be useful in identifying which year yields are higher than or lower than the median yield value within each crop group. We will concern ourselves with US yields only and sort the output by crop type. We’ll save the output dataframe as dat2.

dat2 <- dat %>% 
  filter(Information == "Yield (Hg/Ha)",
         Country == "United States of America") %>%
  select(Crop, Year, Value)                     %>%
  group_by(Crop)                                %>%
  mutate(NormYield = Value - median(Value))     %>%
  arrange(Crop)

Let’s plot the normalized yields by year for Barley and add a 0 line representing the (normalized) central value.

plot( NormYield ~ Year, dat2[dat2$Crop == "Barley",] )
abline(h = 0, col="red")

The relative distribution of points does not change, but the values do (they are re-scaled) allowing us to compare values based on some localized (group) context. This technique will prove very useful later on in the course when EDA topics are explored.

dplyr’s output data structure

Some of dplyr’s functions such as group_by/summarise generate a tibble data table. For example, the dat.grp object created in the last chunk of code is associated with a tb_df (a tibble).

class(dat.grp)
[1] "tbl_df"     "tbl"        "data.frame"

A tibble table will behave a little differently than a data frame table when printing to a screen or subsetting its elements. In most cases, a tibble rendering of the table will not pose a problem in a workflow, however, this format may prove problematic with some older functions. To remedy this, you can force the dat.grp object to a standalone dataframe as follows:

dat.df <- as.data.frame(dat.grp)
class(dat.df)
[1] "data.frame"