This write-up highlights differences between the base, tidyverse and data.table ecosystems as it pertains to data manipulation.

The tidyverse is itself a package that includes eight core packages however, for the purpose of this tutorial, emphasis will be placed on two of its core packages, dplyr and tidyr, which are used for most data manipulation operations. The tidyverse has its own website, https://www.tidyverse.org/, and it is mostly maintained by RStudio staff.

The data.table has its own website too, https://rdatatable.gitlab.io/data.table/. It was originally developed by Matt Dowle in 2008 and is now maintained by him and a large community of contributors. data.table is also available for Python as (pydatatable)[https://github.com/h2oai/datatable]

Before continuing, make sure to attach the following packages.

# Load the two core tidyverse packages
library(dplyr)  
library(tidyr) 

# Load the data.table package
library(data.table)

Pros and cons

Environment base tidyverse data.table
Pros
  • stable
  • available in most R installations
  • uses verbs to describe functions
  • usually faster than base
  • abundant online resources
  • concise syntax
  • very fast
  • memory efficient
  • no dependencies (other than base R)
  • Cons
  • slow
  • syntax can be clunky
  • verbose syntax, can make for lengthy scripts
  • its reliance on functional programming may not be entirely intuitive for beginners
  • lots of dependencies
  • slower than data.table, but when paired with dtplyr, can match data.table speed
  • concise syntax can be difficult to decipher on long chains of code
  • Summary of key functions

    Environment base tidyverse data.table
    Supported data class(es) data.frame data.frame, tibble data.table
    Reading data read.csv read_csv fread
    Subset by column [ , ...] select() [ ,... , ]
    Subset by rows [... , ] filter() [... , , ]
    Create new column df$y = ... mutate(tb, y = ...) [ , y := ..., ]
    Delete a column df$y = NULL select(tb, -y) [ , y := NULL, ]
    Summarize apply(df[ , y], 2, ...) summarise() [ , ...(y), ]
    Grouping aggregate() group_by() [ , , by = ...]
    Pivot to long reshape() pivot_longer() melt()
    Pivot to wide reshape() pivot_wider() dcast()
    Joining tables merge() left_join() DT1[ DT2, on = ...]

    Data prep

    Both base functions and tidyverse functions will accept a data.frame as input. tidyverse functions will also accept the tidyverse’s native tibble format. In the following exercises, a tibble will be passed to all tidyverse functions if only to distinguish a tidyverse operation from a base or data.table operation.

    A data.table, however, requires that the input table be in a native data.table format. But a data.table can be used as input into both base and tidyverse functions. In the examples that follow, data.table tables will be limited to data.table operations to distinguish such operations from those in base and tidyverse.

    mpg cyl hp am car
    21.0 6 110 1 Mazda RX4
    21.0 6 110 1 Mazda RX4 Wag
    22.8 4 93 1 Datsun 710
    21.4 6 110 0 Hornet 4 Drive
    18.7 8 175 0 Hornet Sportabout
    18.1 6 105 0 Valiant
    14.3 8 245 0 Duster 360
    24.4 4 62 0 Merc 240D
    22.8 4 95 0 Merc 230
    19.2 6 123 0 Merc 280
    17.8 6 123 0 Merc 280C
    16.4 8 180 0 Merc 450SE
    17.3 8 180 0 Merc 450SL
    15.2 8 180 0 Merc 450SLC
    10.4 8 205 0 Cadillac Fleetwood
    10.4 8 215 0 Lincoln Continental
    14.7 8 230 0 Chrysler Imperial
    32.4 4 66 1 Fiat 128
    30.4 4 52 1 Honda Civic
    33.9 4 65 1 Toyota Corolla
    21.5 4 97 0 Toyota Corona
    15.5 8 150 0 Dodge Challenger
    15.2 8 150 0 AMC Javelin
    13.3 8 245 0 Camaro Z28
    19.2 8 175 0 Pontiac Firebird
    27.3 4 66 1 Fiat X1-9
    26.0 4 91 1 Porsche 914-2
    30.4 4 113 1 Lotus Europa
    15.8 8 264 1 Ford Pantera L
    19.7 6 175 1 Ferrari Dino
    15.0 8 335 1 Maserati Bora
    21.4 4 109 1 Volvo 142E

    The data.table ecosystem

    The data.table environment benefits from its concise syntax; this eliminates the need to memorize an arsenal of function names. In fact, most data manipulating operations are performed between brackets []. However unlike the base and tidyverse environments, the data must be in a data.table format. This can be accomplished by loading the the data file using the data.table’s fread() function, or by converting the data frame to a data.table via the as.data.table() function.

    The bracket structure differs from that of the base dataframe bracket in that the data.table bracket takes three indices as opposed to two.

    The indices are referenced as follows:

    From there, you can chain other operations.

    Of interest to some, the j index can take on many base R functions including plot().

    For more information on the bracket environments, type:

    ?`[.data.table`: To access the help page for data.table’s [] environment.
    ?`[.data.frame`: To access the help page for base’s [] environment.

    A comparison of the functions

    base R operations
    tidyverse operations
    data.table operations

    Selecting column(s)

    df[ , c(“mpg”, “hp”)]
    select(tb, mpg, hp)
    dt[ , .(mpg, hp)]

    mpg hp
    21.0 110
    21.0 110
    22.8 93
    21.4 110
    18.7 175
    18.1 105
    14.3 245
    24.4 62
    22.8 95
    19.2 123
    17.8 123
    16.4 180
    17.3 180
    15.2 180
    10.4 205
    10.4 215
    14.7 230
    32.4 66
    30.4 52
    33.9 65
    21.5 97
    15.5 150
    15.2 150
    13.3 245
    19.2 175
    27.3 66
    26.0 91
    30.4 113
    15.8 264
    19.7 175
    15.0 335
    21.4 109

    NOTE: in base R, if you extract a single column, the output will be a vector. To force the output to a dataframe, add drop = FALSE as in df[ , "hp", drop = FALSE]

    Subsetting by row number

    df[1:3 , ]
    slice(tb, 1:3)
    dt[1:3, , ]

    mpg cyl hp am car
    21.0 6 110 1 Mazda RX4
    21.0 6 110 1 Mazda RX4 Wag
    22.8 4 93 1 Datsun 710

    Subsetting by variable values

    df[ df$mpg > 20, ]
    filter(tb, mpg > 20 )
    dt[ mpg > 20 , ]

    mpg cyl hp am car
    1 21.0 6 110 1 Mazda RX4
    2 21.0 6 110 1 Mazda RX4 Wag
    3 22.8 4 93 1 Datsun 710
    4 21.4 6 110 0 Hornet 4 Drive
    8 24.4 4 62 0 Merc 240D
    9 22.8 4 95 0 Merc 230
    18 32.4 4 66 1 Fiat 128
    19 30.4 4 52 1 Honda Civic
    20 33.9 4 65 1 Toyota Corolla
    21 21.5 4 97 0 Toyota Corona
    26 27.3 4 66 1 Fiat X1-9
    27 26.0 4 91 1 Porsche 914-2
    28 30.4 4 113 1 Lotus Europa
    32 21.4 4 109 1 Volvo 142E

    Sorting a table (in ascending order)

    df[order(df$mpg) , ]
    arrange(tb, mpg)
    dt[order(mpg) , ]

    mpg cyl hp am car
    15 10.4 8 205 0 Cadillac Fleetwood
    16 10.4 8 215 0 Lincoln Continental
    24 13.3 8 245 0 Camaro Z28
    7 14.3 8 245 0 Duster 360
    17 14.7 8 230 0 Chrysler Imperial
    31 15.0 8 335 1 Maserati Bora
    14 15.2 8 180 0 Merc 450SLC
    23 15.2 8 150 0 AMC Javelin
    22 15.5 8 150 0 Dodge Challenger
    29 15.8 8 264 1 Ford Pantera L
    12 16.4 8 180 0 Merc 450SE
    13 17.3 8 180 0 Merc 450SL
    11 17.8 6 123 0 Merc 280C
    6 18.1 6 105 0 Valiant
    5 18.7 8 175 0 Hornet Sportabout
    10 19.2 6 123 0 Merc 280
    25 19.2 8 175 0 Pontiac Firebird
    30 19.7 6 175 1 Ferrari Dino
    1 21.0 6 110 1 Mazda RX4
    2 21.0 6 110 1 Mazda RX4 Wag
    4 21.4 6 110 0 Hornet 4 Drive
    32 21.4 4 109 1 Volvo 142E
    21 21.5 4 97 0 Toyota Corona
    3 22.8 4 93 1 Datsun 710
    9 22.8 4 95 0 Merc 230
    8 24.4 4 62 0 Merc 240D
    27 26.0 4 91 1 Porsche 914-2
    26 27.3 4 66 1 Fiat X1-9
    19 30.4 4 52 1 Honda Civic
    28 30.4 4 113 1 Lotus Europa
    18 32.4 4 66 1 Fiat 128
    20 33.9 4 65 1 Toyota Corolla

    Sorting a table (in descending order)

    df[order(-df$mpg) , ]
    arrange(tb, -mpg)
    dt[order(-mpg) , ]

    mpg cyl hp am car
    20 33.9 4 65 1 Toyota Corolla
    18 32.4 4 66 1 Fiat 128
    19 30.4 4 52 1 Honda Civic
    28 30.4 4 113 1 Lotus Europa
    26 27.3 4 66 1 Fiat X1-9
    27 26.0 4 91 1 Porsche 914-2
    8 24.4 4 62 0 Merc 240D
    3 22.8 4 93 1 Datsun 710
    9 22.8 4 95 0 Merc 230
    21 21.5 4 97 0 Toyota Corona
    4 21.4 6 110 0 Hornet 4 Drive
    32 21.4 4 109 1 Volvo 142E
    1 21.0 6 110 1 Mazda RX4
    2 21.0 6 110 1 Mazda RX4 Wag
    30 19.7 6 175 1 Ferrari Dino
    10 19.2 6 123 0 Merc 280
    25 19.2 8 175 0 Pontiac Firebird
    5 18.7 8 175 0 Hornet Sportabout
    6 18.1 6 105 0 Valiant
    11 17.8 6 123 0 Merc 280C
    13 17.3 8 180 0 Merc 450SL
    12 16.4 8 180 0 Merc 450SE
    29 15.8 8 264 1 Ford Pantera L
    22 15.5 8 150 0 Dodge Challenger
    14 15.2 8 180 0 Merc 450SLC
    23 15.2 8 150 0 AMC Javelin
    31 15.0 8 335 1 Maserati Bora
    17 14.7 8 230 0 Chrysler Imperial
    7 14.3 8 245 0 Duster 360
    24 13.3 8 245 0 Camaro Z28
    15 10.4 8 205 0 Cadillac Fleetwood
    16 10.4 8 215 0 Lincoln Continental

    Sorting by multiple columns

    df[order(df$cyl, df$mpg) , ]
    arrange(tb, cyl, mpg)
    dt[order(cyl, mpg) , ]

    mpg cyl hp am car
    32 21.4 4 109 1 Volvo 142E
    21 21.5 4 97 0 Toyota Corona
    3 22.8 4 93 1 Datsun 710
    9 22.8 4 95 0 Merc 230
    8 24.4 4 62 0 Merc 240D
    27 26.0 4 91 1 Porsche 914-2
    26 27.3 4 66 1 Fiat X1-9
    19 30.4 4 52 1 Honda Civic
    28 30.4 4 113 1 Lotus Europa
    18 32.4 4 66 1 Fiat 128
    20 33.9 4 65 1 Toyota Corolla
    11 17.8 6 123 0 Merc 280C
    6 18.1 6 105 0 Valiant
    10 19.2 6 123 0 Merc 280
    30 19.7 6 175 1 Ferrari Dino
    1 21.0 6 110 1 Mazda RX4
    2 21.0 6 110 1 Mazda RX4 Wag
    4 21.4 6 110 0 Hornet 4 Drive
    15 10.4 8 205 0 Cadillac Fleetwood
    16 10.4 8 215 0 Lincoln Continental
    24 13.3 8 245 0 Camaro Z28
    7 14.3 8 245 0 Duster 360
    17 14.7 8 230 0 Chrysler Imperial
    31 15.0 8 335 1 Maserati Bora
    14 15.2 8 180 0 Merc 450SLC
    23 15.2 8 150 0 AMC Javelin
    22 15.5 8 150 0 Dodge Challenger
    29 15.8 8 264 1 Ford Pantera L
    12 16.4 8 180 0 Merc 450SE
    13 17.3 8 180 0 Merc 450SL
    5 18.7 8 175 0 Hornet Sportabout
    25 19.2 8 175 0 Pontiac Firebird

    Creating a new column

    df$gpm <- 1 / df$mpg
    tb <- mutate(tb, gpm = 1 / mpg)
    dt[ , gpm := 1 / mpg, ]

    mpg cyl hp am car gpm
    21.0 6 110 1 Mazda RX4 0.0476190
    21.0 6 110 1 Mazda RX4 Wag 0.0476190
    22.8 4 93 1 Datsun 710 0.0438596
    21.4 6 110 0 Hornet 4 Drive 0.0467290
    18.7 8 175 0 Hornet Sportabout 0.0534759
    18.1 6 105 0 Valiant 0.0552486
    14.3 8 245 0 Duster 360 0.0699301
    24.4 4 62 0 Merc 240D 0.0409836
    22.8 4 95 0 Merc 230 0.0438596
    19.2 6 123 0 Merc 280 0.0520833
    17.8 6 123 0 Merc 280C 0.0561798
    16.4 8 180 0 Merc 450SE 0.0609756
    17.3 8 180 0 Merc 450SL 0.0578035
    15.2 8 180 0 Merc 450SLC 0.0657895
    10.4 8 205 0 Cadillac Fleetwood 0.0961538
    10.4 8 215 0 Lincoln Continental 0.0961538
    14.7 8 230 0 Chrysler Imperial 0.0680272
    32.4 4 66 1 Fiat 128 0.0308642
    30.4 4 52 1 Honda Civic 0.0328947
    33.9 4 65 1 Toyota Corolla 0.0294985
    21.5 4 97 0 Toyota Corona 0.0465116
    15.5 8 150 0 Dodge Challenger 0.0645161
    15.2 8 150 0 AMC Javelin 0.0657895
    13.3 8 245 0 Camaro Z28 0.0751880
    19.2 8 175 0 Pontiac Firebird 0.0520833
    27.3 4 66 1 Fiat X1-9 0.0366300
    26.0 4 91 1 Porsche 914-2 0.0384615
    30.4 4 113 1 Lotus Europa 0.0328947
    15.8 8 264 1 Ford Pantera L 0.0632911
    19.7 6 175 1 Ferrari Dino 0.0507614
    15.0 8 335 1 Maserati Bora 0.0666667
    21.4 4 109 1 Volvo 142E 0.0467290

    Deleting a column

    df$gpm <- NULL
    tb <- select(tb, -gpm)
    dt[ , gpm := NULL, ]

    mpg cyl hp am car
    21.0 6 110 1 Mazda RX4
    21.0 6 110 1 Mazda RX4 Wag
    22.8 4 93 1 Datsun 710
    21.4 6 110 0 Hornet 4 Drive
    18.7 8 175 0 Hornet Sportabout
    18.1 6 105 0 Valiant
    14.3 8 245 0 Duster 360
    24.4 4 62 0 Merc 240D
    22.8 4 95 0 Merc 230
    19.2 6 123 0 Merc 280
    17.8 6 123 0 Merc 280C
    16.4 8 180 0 Merc 450SE
    17.3 8 180 0 Merc 450SL
    15.2 8 180 0 Merc 450SLC
    10.4 8 205 0 Cadillac Fleetwood
    10.4 8 215 0 Lincoln Continental
    14.7 8 230 0 Chrysler Imperial
    32.4 4 66 1 Fiat 128
    30.4 4 52 1 Honda Civic
    33.9 4 65 1 Toyota Corolla
    21.5 4 97 0 Toyota Corona
    15.5 8 150 0 Dodge Challenger
    15.2 8 150 0 AMC Javelin
    13.3 8 245 0 Camaro Z28
    19.2 8 175 0 Pontiac Firebird
    27.3 4 66 1 Fiat X1-9
    26.0 4 91 1 Porsche 914-2
    30.4 4 113 1 Lotus Europa
    15.8 8 264 1 Ford Pantera L
    19.7 6 175 1 Ferrari Dino
    15.0 8 335 1 Maserati Bora
    21.4 4 109 1 Volvo 142E

    Summarizing specific columns

    apply(df[ , c(“mpg”,“hp”)], 2, median)
    summarise(tb, mpg = median(mpg), 
                  hp = median(hp))
    

    dt[ , .(mpg = median(mpg), 
            hp = mean(hp)) , ]
    
    mpg hp
    19.2 123

    NOTE: The base method returns a vector while the tidyverse and data.table methods return a table.

    Summarizing all columns

    apply(df, 2, max)
    summarise_each(tb, max)
    

    dt[ , lapply(.SD,  max) , ]
    
    mpg cyl hp am car
    33.9 8 335 1 Volvo 142E
    • NOTE: The apply function returns a vector. As such, the data type will inheret the highest common mode (character in this example).

    Summarizing all columns by group

    aggregate(df, list(df$cyl), max)
    group_by(tb, cyl) %>% summarise_each(max)
    

    dt[ , lapply(.SD,  max), by = cyl ] 
    
    cyl mpg hp am car
    6 21.4 175 1 Valiant
    4 33.9 113 1 Volvo 142E
    8 19.2 335 1 Pontiac Firebird

    Pivoting table to long format

    df.l <- reshape(df, idvar = "car", 
       times = names(df)[names(df) != "car"],
       timevar = "variable", v.names="value",
       varying = list(names(df)
                      [names(df) != "car"]),
       direction = "long")
    

    tb.l <- pivot_longer(tb, 
                     names_to = "variable", 
                     values_to = "value", 
                     -car)
    

    dt.l <- melt(dt, id.vars = c("car"), 
                 variable.name = "variable", 
                 value.name = "value")
    
    car variable value
    Mazda RX4 mpg 21.0
    Mazda RX4 cyl 6.0
    Mazda RX4 hp 110.0
    Mazda RX4 am 1.0
    Mazda RX4 Wag mpg 21.0
    Mazda RX4 Wag cyl 6.0
    Mazda RX4 Wag hp 110.0
    Mazda RX4 Wag am 1.0
    Datsun 710 mpg 22.8
    Datsun 710 cyl 4.0
    Datsun 710 hp 93.0
    Datsun 710 am 1.0
    Hornet 4 Drive mpg 21.4
    Hornet 4 Drive cyl 6.0
    Hornet 4 Drive hp 110.0
    Hornet 4 Drive am 0.0
    Hornet Sportabout mpg 18.7
    Hornet Sportabout cyl 8.0
    Hornet Sportabout hp 175.0
    Hornet Sportabout am 0.0
    Valiant mpg 18.1
    Valiant cyl 6.0
    Valiant hp 105.0
    Valiant am 0.0
    Duster 360 mpg 14.3
    Duster 360 cyl 8.0
    Duster 360 hp 245.0
    Duster 360 am 0.0
    Merc 240D mpg 24.4
    Merc 240D cyl 4.0
    Merc 240D hp 62.0
    Merc 240D am 0.0
    Merc 230 mpg 22.8
    Merc 230 cyl 4.0
    Merc 230 hp 95.0
    Merc 230 am 0.0
    Merc 280 mpg 19.2
    Merc 280 cyl 6.0
    Merc 280 hp 123.0
    Merc 280 am 0.0
    Merc 280C mpg 17.8
    Merc 280C cyl 6.0
    Merc 280C hp 123.0
    Merc 280C am 0.0
    Merc 450SE mpg 16.4
    Merc 450SE cyl 8.0
    Merc 450SE hp 180.0
    Merc 450SE am 0.0
    Merc 450SL mpg 17.3
    Merc 450SL cyl 8.0
    Merc 450SL hp 180.0
    Merc 450SL am 0.0
    Merc 450SLC mpg 15.2
    Merc 450SLC cyl 8.0
    Merc 450SLC hp 180.0
    Merc 450SLC am 0.0
    Cadillac Fleetwood mpg 10.4
    Cadillac Fleetwood cyl 8.0
    Cadillac Fleetwood hp 205.0
    Cadillac Fleetwood am 0.0
    Lincoln Continental mpg 10.4
    Lincoln Continental cyl 8.0
    Lincoln Continental hp 215.0
    Lincoln Continental am 0.0
    Chrysler Imperial mpg 14.7
    Chrysler Imperial cyl 8.0
    Chrysler Imperial hp 230.0
    Chrysler Imperial am 0.0
    Fiat 128 mpg 32.4
    Fiat 128 cyl 4.0
    Fiat 128 hp 66.0
    Fiat 128 am 1.0
    Honda Civic mpg 30.4
    Honda Civic cyl 4.0
    Honda Civic hp 52.0
    Honda Civic am 1.0
    Toyota Corolla mpg 33.9
    Toyota Corolla cyl 4.0
    Toyota Corolla hp 65.0
    Toyota Corolla am 1.0
    Toyota Corona mpg 21.5
    Toyota Corona cyl 4.0
    Toyota Corona hp 97.0
    Toyota Corona am 0.0
    Dodge Challenger mpg 15.5
    Dodge Challenger cyl 8.0
    Dodge Challenger hp 150.0
    Dodge Challenger am 0.0
    AMC Javelin mpg 15.2
    AMC Javelin cyl 8.0
    AMC Javelin hp 150.0
    AMC Javelin am 0.0
    Camaro Z28 mpg 13.3
    Camaro Z28 cyl 8.0
    Camaro Z28 hp 245.0
    Camaro Z28 am 0.0
    Pontiac Firebird mpg 19.2
    Pontiac Firebird cyl 8.0
    Pontiac Firebird hp 175.0
    Pontiac Firebird am 0.0
    Fiat X1-9 mpg 27.3
    Fiat X1-9 cyl 4.0
    Fiat X1-9 hp 66.0
    Fiat X1-9 am 1.0
    Porsche 914-2 mpg 26.0
    Porsche 914-2 cyl 4.0
    Porsche 914-2 hp 91.0
    Porsche 914-2 am 1.0
    Lotus Europa mpg 30.4
    Lotus Europa cyl 4.0
    Lotus Europa hp 113.0
    Lotus Europa am 1.0
    Ford Pantera L mpg 15.8
    Ford Pantera L cyl 8.0
    Ford Pantera L hp 264.0
    Ford Pantera L am 1.0
    Ferrari Dino mpg 19.7
    Ferrari Dino cyl 6.0
    Ferrari Dino hp 175.0
    Ferrari Dino am 1.0
    Maserati Bora mpg 15.0
    Maserati Bora cyl 8.0
    Maserati Bora hp 335.0
    Maserati Bora am 1.0
    Volvo 142E mpg 21.4
    Volvo 142E cyl 4.0
    Volvo 142E hp 109.0
    Volvo 142E am 1.0

    NOTE 1: The output order will differ between these methods.
    NOTE 2: The base reshape function creates row names from the car variable. We’ll remove the row names to match the other table’s structure.

    Pivoting table to wide format

    df.w <- reshape(df.l, 
                 idvar = "car",
                 timevar="variable", 
                 v.names="value",  
                 direction = "wide")
    

    tb.w <- pivot_wider(tb.l, 
                 names_from = variable, 
                 values_from = value)
    

    dt.w <- dcast(dt.l, car ~ variable, 
                  value.var="value")
    
    car mpg cyl hp am
    Mazda RX4 21.0 6 110 1
    Mazda RX4 Wag 21.0 6 110 1
    Datsun 710 22.8 4 93 1
    Hornet 4 Drive 21.4 6 110 0
    Hornet Sportabout 18.7 8 175 0
    Valiant 18.1 6 105 0
    Duster 360 14.3 8 245 0
    Merc 240D 24.4 4 62 0
    Merc 230 22.8 4 95 0
    Merc 280 19.2 6 123 0
    Merc 280C 17.8 6 123 0
    Merc 450SE 16.4 8 180 0
    Merc 450SL 17.3 8 180 0
    Merc 450SLC 15.2 8 180 0
    Cadillac Fleetwood 10.4 8 205 0
    Lincoln Continental 10.4 8 215 0
    Chrysler Imperial 14.7 8 230 0
    Fiat 128 32.4 4 66 1
    Honda Civic 30.4 4 52 1
    Toyota Corolla 33.9 4 65 1
    Toyota Corona 21.5 4 97 0
    Dodge Challenger 15.5 8 150 0
    AMC Javelin 15.2 8 150 0
    Camaro Z28 13.3 8 245 0
    Pontiac Firebird 19.2 8 175 0
    Fiat X1-9 27.3 4 66 1
    Porsche 914-2 26.0 4 91 1
    Lotus Europa 30.4 4 113 1
    Ford Pantera L 15.8 8 264 1
    Ferrari Dino 19.7 6 175 1
    Maserati Bora 15.0 8 335 1
    Volvo 142E 21.4 4 109 1

    Joining tables

    We’ll first create a lookup table that pairs up the am variable values with a text descriptor.

    x y
    0 automatic
    1 manual

    Next, we join the tables.

    merge(df, df.lu, by.x = "am", 
              by.y="x", all.x = TRUE)
    

    left_join(tb, tb.lu, 
              by = c("am" = "x"))
    

    dt[dt.lu, on = c("am" = "x")]
    
    am mpg cyl hp car y
    0 18.7 8 175 Hornet Sportabout automatic
    0 22.8 4 95 Merc 230 automatic
    0 21.4 6 110 Hornet 4 Drive automatic
    0 17.3 8 180 Merc 450SL automatic
    0 18.1 6 105 Valiant automatic
    0 14.3 8 245 Duster 360 automatic
    0 24.4 4 62 Merc 240D automatic
    0 14.7 8 230 Chrysler Imperial automatic
    0 19.2 6 123 Merc 280 automatic
    0 17.8 6 123 Merc 280C automatic
    0 16.4 8 180 Merc 450SE automatic
    0 21.5 4 97 Toyota Corona automatic
    0 15.2 8 180 Merc 450SLC automatic
    0 10.4 8 205 Cadillac Fleetwood automatic
    0 10.4 8 215 Lincoln Continental automatic
    0 19.2 8 175 Pontiac Firebird automatic
    0 15.5 8 150 Dodge Challenger automatic
    0 15.2 8 150 AMC Javelin automatic
    0 13.3 8 245 Camaro Z28 automatic
    1 21.0 6 110 Mazda RX4 manual
    1 21.0 6 110 Mazda RX4 Wag manual
    1 22.8 4 93 Datsun 710 manual
    1 33.9 4 65 Toyota Corolla manual
    1 15.8 8 264 Ford Pantera L manual
    1 32.4 4 66 Fiat 128 manual
    1 30.4 4 52 Honda Civic manual
    1 30.4 4 113 Lotus Europa manual
    1 27.3 4 66 Fiat X1-9 manual
    1 26.0 4 91 Porsche 914-2 manual
    1 21.4 4 109 Volvo 142E manual
    1 19.7 6 175 Ferrari Dino manual
    1 15.0 8 335 Maserati Bora manual

    Chaining commands

    Base R does not have native chaining structure like the tidyverse or data.table. Instead, it relies on a series of standalone operations. However, one technique that can be adopted to make it seem as though a series of operations are to be run in unison is to pass each intermediate steps to the . object as in . <- mean(x).

    The tidyverse adopts margeritr’s pipe, %>%, to piece operations together.

    The data.table pieces bracketed operations back to back as in [...][...].... Note that if you are to split data.table operations across multiple lines, you must split the line in the middle of an open bracket as in:

    [ ,do something>][
     do another thing][
       and so on]

    In this example, we’ll first create a new variable, gpm, then we’ll compute the median gpm value grouped by cyl before sorting the summary in descending order.

    df$gpm <- 1/df$mpg
    . <- df[ , c("cyl", "gpm")]
    . <- aggregate(., list(df$cyl), median)
    .$Group.1 <- NULL
    .[order(-.$gpm), ]
    

    tb %>% 
      mutate(gpm = 1/mpg) %>% 
      group_by(cyl) %>% 
      summarise(gpm = median(gpm)) %>% 
      arrange(-gpm)
    

    dt[ , gpm := 1/mpg, ][ 
       order(-gpm), .(gpm = median(gpm)), 
       by = cyl]
    
    cyl gpm
    3 8 0.0657895
    2 6 0.0507614
    1 4 0.0384615

    Copyleft Manuel Gimond, 2019