Colby R User Group--Manny Gimond--8.30.2017

The Tidyverse

  • What is the Tidyverse?
  • Base R vs. Tidyverse examples
  • Teaching the Tidyverse (or not)

Packages fall into two groups

  • Those that add to the base R functionality (this accounts for most packages)
  • Those that augment or usurp base functionality (ggplot2 is a good example)

There are more than 10,000 packages on CRAN!
(This does not include those available via the BioConductor repo)

What's a Tidyverse package?

“[It's] an opinionated collection of R packages designed for data science.”

https://www.tidyverse.org/

Core set of packages include:

  • tibble data table format
  • ggplot2 data visualisation
  • dplyr data manipulation
  • tidyr data tidying
  • readr data import
  • purrr for-loop replacement (functional programming)

Loading tidyverse packages

A single line of code,

library(tidyverse)

or,

library(tibble)
library(ggplot2)
library(dplyr)
library(tidyr)
library(readr)
library(purrr)

Tibble vs. data frame

Data frame

class(mtcars)
[1] "data.frame"

Tibble

mtcars.t <- as.tibble(mtcars)
class(mtcars.t)
[1] "tbl_df"     "tbl"        "data.frame"

Tibble vs. data frame: cleaner output

Data frame

mtcars
                     mpg cyl  disp  hp drat    wt  qsec vs
Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0
Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0
Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1
Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1
Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0
Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1
Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0
Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  1
Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  1
Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1
Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1
Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0
Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0
Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0
Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0
Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0
Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0
Fiat 128            32.4   4  78.7  66 4.08 2.200 19.47  1
Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52  1
Toyota Corolla      33.9   4  71.1  65 4.22 1.835 19.90  1
Toyota Corona       21.5   4 120.1  97 3.70 2.465 20.01  1
Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0
AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0
Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0
Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0
Fiat X1-9           27.3   4  79.0  66 4.08 1.935 18.90  1
Porsche 914-2       26.0   4 120.3  91 4.43 2.140 16.70  0
Lotus Europa        30.4   4  95.1 113 3.77 1.513 16.90  1
Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0
Ferrari Dino        19.7   6 145.0 175 3.62 2.770 15.50  0
Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0
Volvo 142E          21.4   4 121.0 109 4.11 2.780 18.60  1
                    am gear carb
Mazda RX4            1    4    4
Mazda RX4 Wag        1    4    4
Datsun 710           1    4    1
Hornet 4 Drive       0    3    1
Hornet Sportabout    0    3    2
Valiant              0    3    1
Duster 360           0    3    4
Merc 240D            0    4    2
Merc 230             0    4    2
Merc 280             0    4    4
Merc 280C            0    4    4
Merc 450SE           0    3    3
Merc 450SL           0    3    3
Merc 450SLC          0    3    3
Cadillac Fleetwood   0    3    4
Lincoln Continental  0    3    4
Chrysler Imperial    0    3    4
Fiat 128             1    4    1
Honda Civic          1    4    2
Toyota Corolla       1    4    1
Toyota Corona        0    3    1
Dodge Challenger     0    3    2
AMC Javelin          0    3    2
Camaro Z28           0    3    4
Pontiac Firebird     0    3    2
Fiat X1-9            1    4    1
Porsche 914-2        1    5    2
Lotus Europa         1    5    2
Ford Pantera L       1    5    4
Ferrari Dino         1    5    6
Maserati Bora        1    5    8
Volvo 142E           1    4    2

Tibble

mtcars.t
# A tibble: 32 x 11
     mpg   cyl  disp    hp  drat    wt  qsec    vs    am
 * <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1  21.0     6 160.0   110  3.90 2.620 16.46     0     1
 2  21.0     6 160.0   110  3.90 2.875 17.02     0     1
 3  22.8     4 108.0    93  3.85 2.320 18.61     1     1
 4  21.4     6 258.0   110  3.08 3.215 19.44     1     0
 5  18.7     8 360.0   175  3.15 3.440 17.02     0     0
 6  18.1     6 225.0   105  2.76 3.460 20.22     1     0
 7  14.3     8 360.0   245  3.21 3.570 15.84     0     0
 8  24.4     4 146.7    62  3.69 3.190 20.00     1     0
 9  22.8     4 140.8    95  3.92 3.150 22.90     1     0
10  19.2     6 167.6   123  3.92 3.440 18.30     1     0
# ... with 22 more rows, and 2 more variables: gear <dbl>,
#   carb <dbl>

Tibble vs. data frame: no partial matching

Data frame

mtcars$h
 [1] 110 110  93 110 175 105 245  62  95 123 123 180 180 180
[15] 205 215 230  66  52  65  97 150 150 245 175  66  91 113
[29] 264 175 335 109

Tibble

mtcars.t$h
NULL

Tibble vs. data frame: tibble always returns a tibble

Data frame

mtcars[ , 1]
 [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8
[12] 16.4 17.3 15.2 10.4 10.4 14.7 32.4 30.4 33.9 21.5 15.5
[23] 15.2 13.3 19.2 27.3 26.0 30.4 15.8 19.7 15.0 21.4

… returns a vector

Tibble

mtcars.t[ , 1]
# A tibble: 32 x 1
     mpg
   <dbl>
 1  21.0
 2  21.0
 3  22.8
 4  21.4
 5  18.7
 6  18.1
 7  14.3
 8  24.4
 9  22.8
10  19.2
# ... with 22 more rows

… returns a column

Tibble vs. data frame: tibble always returns a tibble

Data frame

hist(mtcars[ , 1])

plot of chunk unnamed-chunk-12

Tibble

hist(mtcars.t[ , 1])
Error in hist.default(mtcars.t[, 1]): 'x' must be numeric

hist() is expecting a vector, not a one column table

Tibble vs. data frame: lazy evaluation

Data frame

data.frame(x = 1:5, 
           y = 11:15, 
           z = sqrt(x^2+y^2) )
Error in data.frame(x = 1:5, y = 11:15, z = sqrt(x^2 + y^2)): object 'x' not found

Tibble

tibble(x = 1:5, 
       y = 11:15, 
       z = sqrt(x^2+y^2) )
# A tibble: 5 x 3
      x     y        z
  <int> <int>    <dbl>
1     1    11 11.04536
2     2    12 12.16553
3     3    13 13.34166
4     4    14 14.56022
5     5    15 15.81139

readr package: read.csv vs read_csv

read.csv

df <- read.csv("FAO_grains_NA.csv")
class(df)
[1] "data.frame"

read_csv

library(readr)
tb <- read_csv("FAO_grains_NA.csv")
class(tb)
[1] "tbl_df"     "tbl"        "data.frame"
  • faster
  • behaves the same across platforms

readr package: read.csv vs read_csv

read.csv

summary(df)
                     Country              Crop    
 Canada                  :730   Barley      :208  
 United States of America:771   Maize       :208  
                                Oats        :208  
                                Rye         :208  
                                Buckwheat   :200  
                                Grain, mixed:104  
                                (Other)     :365  
              Information       Year     
 Area harvested (Ha):752   Min.   :1961  
 Yield (Hg/Ha)      :749   1st Qu.:1974  
                           Median :1987  
                           Mean   :1987  
                           3rd Qu.:2000  
                           Max.   :2012  

     Value         
 Min.   :       0  
 1st Qu.:   19551  
 Median :   47131  
 Mean   : 1622720  
 3rd Qu.:  558070  
 Max.   :35400000  

                                      Source   
 Calculated data                         :749  
 FAO data based on imputation methodology: 17  
 FAO estimate                            : 73  
 Official data                           :662  



read_csv

summary(tb)
   Country              Crop           Information       
 Length:1501        Length:1501        Length:1501       
 Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character  



      Year          Value             Source         
 Min.   :1961   Min.   :       0   Length:1501       
 1st Qu.:1974   1st Qu.:   19551   Class :character  
 Median :1987   Median :   47131   Mode  :character  
 Mean   :1987   Mean   : 1622720                     
 3rd Qu.:2000   3rd Qu.:  558070                     
 Max.   :2012   Max.   :35400000                     

Data manipulation: Filtering by row

base R

subset(mtcars, hp > 200)
                     mpg cyl disp  hp drat    wt  qsec vs
Duster 360          14.3   8  360 245 3.21 3.570 15.84  0
Cadillac Fleetwood  10.4   8  472 205 2.93 5.250 17.98  0
Lincoln Continental 10.4   8  460 215 3.00 5.424 17.82  0
Chrysler Imperial   14.7   8  440 230 3.23 5.345 17.42  0
Camaro Z28          13.3   8  350 245 3.73 3.840 15.41  0
Ford Pantera L      15.8   8  351 264 4.22 3.170 14.50  0
Maserati Bora       15.0   8  301 335 3.54 3.570 14.60  0
                    am gear carb
Duster 360           0    3    4
Cadillac Fleetwood   0    3    4
Lincoln Continental  0    3    4
Chrysler Imperial    0    3    4
Camaro Z28           0    3    4
Ford Pantera L       1    5    4
Maserati Bora        1    5    8

WARNING: subset() is not without its flaws!

dplyr

filter(mtcars, hp > 200)
   mpg cyl disp  hp drat    wt  qsec vs am gear carb
1 14.3   8  360 245 3.21 3.570 15.84  0  0    3    4
2 10.4   8  472 205 2.93 5.250 17.98  0  0    3    4
3 10.4   8  460 215 3.00 5.424 17.82  0  0    3    4
4 14.7   8  440 230 3.23 5.345 17.42  0  0    3    4
5 13.3   8  350 245 3.73 3.840 15.41  0  0    3    4
6 15.8   8  351 264 4.22 3.170 14.50  0  1    5    4
7 15.0   8  301 335 3.54 3.570 14.60  0  1    5    8

Data manipulation: Select by column

base R

mtcars[ , c("disp","hp")]
                     disp  hp
Mazda RX4           160.0 110
Mazda RX4 Wag       160.0 110
Datsun 710          108.0  93
Hornet 4 Drive      258.0 110
Hornet Sportabout   360.0 175
Valiant             225.0 105
Duster 360          360.0 245
Merc 240D           146.7  62
Merc 230            140.8  95
Merc 280            167.6 123
Merc 280C           167.6 123
Merc 450SE          275.8 180
Merc 450SL          275.8 180
Merc 450SLC         275.8 180
Cadillac Fleetwood  472.0 205
Lincoln Continental 460.0 215
Chrysler Imperial   440.0 230
Fiat 128             78.7  66
Honda Civic          75.7  52
Toyota Corolla       71.1  65
Toyota Corona       120.1  97
Dodge Challenger    318.0 150
AMC Javelin         304.0 150
Camaro Z28          350.0 245
Pontiac Firebird    400.0 175
Fiat X1-9            79.0  66
Porsche 914-2       120.3  91
Lotus Europa         95.1 113
Ford Pantera L      351.0 264
Ferrari Dino        145.0 175
Maserati Bora       301.0 335
Volvo 142E          121.0 109

dplyr

select(mtcars, disp, hp)
                     disp  hp
Mazda RX4           160.0 110
Mazda RX4 Wag       160.0 110
Datsun 710          108.0  93
Hornet 4 Drive      258.0 110
Hornet Sportabout   360.0 175
Valiant             225.0 105
Duster 360          360.0 245
Merc 240D           146.7  62
Merc 230            140.8  95
Merc 280            167.6 123
Merc 280C           167.6 123
Merc 450SE          275.8 180
Merc 450SL          275.8 180
Merc 450SLC         275.8 180
Cadillac Fleetwood  472.0 205
Lincoln Continental 460.0 215
Chrysler Imperial   440.0 230
Fiat 128             78.7  66
Honda Civic          75.7  52
Toyota Corolla       71.1  65
Toyota Corona       120.1  97
Dodge Challenger    318.0 150
AMC Javelin         304.0 150
Camaro Z28          350.0 245
Pontiac Firebird    400.0 175
Fiat X1-9            79.0  66
Porsche 914-2       120.3  91
Lotus Europa         95.1 113
Ford Pantera L      351.0 264
Ferrari Dino        145.0 175
Maserati Bora       301.0 335
Volvo 142E          121.0 109

Data manipulation: add/compute column

base R

mtcars$ratio <- mtcars$wt / mtcars$hp
head(mtcars,3)
               mpg cyl disp  hp drat    wt  qsec vs am gear
Mazda RX4     21.0   6  160 110 3.90 2.620 16.46  0  1    4
Mazda RX4 Wag 21.0   6  160 110 3.90 2.875 17.02  0  1    4
Datsun 710    22.8   4  108  93 3.85 2.320 18.61  1  1    4
              carb      ratio
Mazda RX4        4 0.02381818
Mazda RX4 Wag    4 0.02613636
Datsun 710       1 0.02494624

Each addition/computation requires its own line of code.

dplyr

mtcars <- mutate(mtcars, ratio = wt / hp)
head(mtcars,3)
   mpg cyl disp  hp drat    wt  qsec vs am gear carb
1 21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
2 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
3 22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
       ratio
1 0.02381818
2 0.02613636
3 0.02494624
mtcars <- mutate(mtcars, ratio = wt / hp, 
                         ratio2 = hp / disp)
head(mtcars,3)
   mpg cyl disp  hp drat    wt  qsec vs am gear carb
1 21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
2 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
3 22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
       ratio    ratio2
1 0.02381818 0.6875000
2 0.02613636 0.6875000
3 0.02494624 0.8611111

Data manipulation: summaries data by group

base R

aggregate(mtcars$mpg, by=list(mtcars$cyl), FUN=mean)
  Group.1        x
1       4 26.66364
2       6 19.74286
3       8 15.10000

Each summary requires its own line of code.

dplyr

group_by(mtcars, cyl) %>% summarise(mean(mpg))
# A tibble: 3 x 2
    cyl `mean(mpg)`
  <dbl>       <dbl>
1     4    26.66364
2     6    19.74286
3     8    15.10000
group_by(mtcars, cyl) %>% summarise(mean(mpg), mean(hp))
# A tibble: 3 x 3
    cyl `mean(mpg)` `mean(hp)`
  <dbl>       <dbl>      <dbl>
1     4    26.66364   82.63636
2     6    19.74286  122.28571
3     8    15.10000  209.21429

Data manipulation: ifelse vs if_else

base R

library(lubridate)
y <- mdy("1/23/2016", "12/1/1901", "11/23/2016")
ifelse( year(y) != 2016, mdy(NA), y)
[1] 16823    NA 17128

ifelse does not respect data type (except for numeric and character)

dplyr

library(lubridate)
y <- mdy("1/23/2016", "12/1/1901", "11/23/2016")
if_else( year(y) != 2016, mdy(NA), y)
[1] "2016-01-23" NA           "2016-11-23"

Data manipulation: ifelse vs recode (for factors)

base R

x <- as.factor( c("banana", "pear", "apple"))
ifelse(x == "pear", "apple", x)
[1] "2"     "apple" "1"    

ifelse returns level number for factors

dplyr

x <- as.factor( c( "banana", "pear", "apple"))
recode(x , "pear" = "apple")
[1] banana apple  apple 
Levels: apple banana

Data manipulation: nested ifelse vs case_when

base R

z <- c(1, -2, 102)
ifelse( z < 0, abs(z), 
         ifelse(z > 100, z - 100, z))
[1] 1 2 2

dplyr

z <- c(1, -2, 102)
case_when( z < 0 ~ abs(z),
           z > 100 ~ z -100,
           TRUE ~ z)
[1] 1 2 2

When should a package be adopted?

  • It's mainstream (enough)
  • It significantly improves worklfow
  • It is well supported (now and in the future)

Should we teach base R, Tidyverse or both?

Does this matter in your course?

Should we seek consistency across courses?