While piping operations are not uncommon in many programming environments, piping has only recently found its way into the R programming environment by way of Stefan Milton Bache’s magrittr
package (now part of the tidyverse suite of packages). Its infix operator is written as %>%
.
Take the following series of operations:
<- subset(mtcars, select = c(hp, mpg))
dat1 summary(dat1)
## hp mpg
## Min. : 52.0 Min. :10.40
## 1st Qu.: 96.5 1st Qu.:15.43
## Median :123.0 Median :19.20
## Mean :146.7 Mean :20.09
## 3rd Qu.:180.0 3rd Qu.:22.80
## Max. :335.0 Max. :33.90
The mtcars
dataframe is going through two operations: a table subset, then a summary operation. This approach requires that an intermediate object be created.
A more succinct chunk would look like this:
summary( subset(mtcars, select = c(hp, mpg)))
## hp mpg
## Min. : 52.0 Min. :10.40
## 1st Qu.: 96.5 1st Qu.:15.43
## Median :123.0 Median :19.20
## Mean :146.7 Mean :20.09
## 3rd Qu.:180.0 3rd Qu.:22.80
## Max. :335.0 Max. :33.90
However, we are trading readability for succinctness.
A compromise between the two using the pipe looks like this:
library(magrittr)
%>%
mtcars subset(select = mpg:hp) %>%
summary()
## mpg cyl disp hp
## Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0
## 1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5
## Median :19.20 Median :6.000 Median :196.3 Median :123.0
## Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7
## 3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0
## Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0
This approach avoids the need for intermediate objects while offering an easy to follow workflow.
R version 4.1 introduces the new native pipe: |>
. It behaves much like %>%
, at least from the user’s perspective. So, the above code chunk can be written without relying on the magrittr
package as follows:
|>
mtcars subset(select = mpg:hp) |>
summary()
## mpg cyl disp hp
## Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0
## 1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5
## Median :19.20 Median :6.000 Median :196.3 Median :123.0
## Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7
## 3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0
## Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0
RStudio offers the shortcut key for the magrittr
pipe: ctr + shift + M
on Windows machines and cmd + shift + M
on Macs.
RStudio does not yet offer a dedicated shortcut key for the native pipe but it does offer the option to choose which pipe to assign to that shortcut key. This option can be specified via the Options menu (note that as of this writing, this feature is only available in the preview version of RStudio).
A pipe feeds the contents (or output) from the left hand side (LHS) into the first unnamed argument of the right hand side (RHS) function. So in the following example, the pipe feeds the mtcars
dataframe into the first argument of subset()
.
|> subset(select = mpg:hp) mtcars
The first argument in subset
is the data object argument, x
. Note that subset
has several methods. If a dataframe is passed to subset
, the method called is subset.data.frame()
. We can list its arguments using the following command.
formalArgs(subset.data.frame)
## [1] "x" "subset" "select" "drop" "..."
The first argument is x =
(the input dataframe). So in the above piping operation mtcars
is piped as a parameter to the x
argument of the subset
function.
Knowing this can help troubleshoot unwelcome scenarios. For example, what happens if the LHS gets piped to a function on the RHS that does not have input data as its first argument?
|> lm(hp ~ mpg ) mtcars
## Error in as.data.frame.default(data): cannot coerce class '"formula"' to a data.frame
lm
has its data input argument, data
, as its second argument. Hence, the pipe is assigning mtcars
to formula
which is the first argument in the lm
function.
formalArgs(lm)
## [1] "formula" "data" "subset" "weights" "na.action"
## [6] "method" "model" "x" "y" "qr"
## [11] "singular.ok" "contrasts" "offset" "..."
You’ll note that we defined the formula, hp ~ mpg
, in the above code chunk, however, it’s not being explicitly assigned to the formula
argument. So R is interpreting the above piping operation as:
lm(formula = mtcars, data = hp ~ mpg)
which generates an error message.
One solution is to explicitly name the formula
argument to prevent the pipe from assigning mtcars
to formula
:
|> lm(formula = hp ~ mpg ) mtcars
##
## Call:
## lm(formula = hp ~ mpg, data = mtcars)
##
## Coefficients:
## (Intercept) mpg
## 324.08 -8.83
In the above example, the formula =
argument is explicitly spelled out thus forcing the pipe to look for the next argument not explicitly named in the code chunk. Once found, it assigns the LHS as that argument’s parameter. In the above code chunk, this next argument is data
(which is what we want o pipe mtcars
into). This works with both |>
and %>%
.
In some cases, naming arguments (as demonstrated in the previous example) may not be suitable. For example the following plot function does not generate a scatter plot of hp
vs mpg
as we might have expected, even though we are explicitly naming the argument being assigned the hp ~ mpg
formula.
|> plot(formula = hp ~ mpg) mtcars
While the above does not generate an error, it’s not generating the desired plot (i.e. a single scatter plot of hp
vs. mpg
).
Even though the generic plot
function accepts a formula, it does not have formula
as an argument:
args(plot)
## function (x, y, ...)
## NULL
So plot
ignores the formula = hp ~ mpg
argument in our code chunk and is, in essence, running the code plot(mtcars)
which will generate scatter plot matrices for all combinations of paired variables in the data.
So why will plot
accept a formula and yet not recognize the formula
argument? Being a generic method, plot
will pass the arguments to the plot
method it thinks is needed given the argument type. Here, the plot method needed for a formula is graphics:::plot.formula
. So, to make use of a named argument, you would need to modify the previous chunk by specifying the plot.formula
method as follows:
|> graphics:::plot.formula(formula = hp ~ mpg) mtcars
This approach to solving named argument roadblocks can be time consuming and lead to frustration. A few (simpler) solutions are presented next.
%>%
offers the placeholder .
, |>
does notOne notable difference between |>
and %>%
is the lack of a placeholder. Magritter’s %>%
offers the .
placeholder which can be used to explicitly specify where the LHS is to be placed in the RHS’s function. For example, to circumvent the missing formula
argument from the generic plot
function, you could place a .
in the plot function where you would want the LHS to be piped into. For example:
%>% plot( hp ~ mpg, data = . ) mtcars
Note that the only argument being named is data
–the argument to receive the LHS.
The native pipe does not have a placeholder. This is to maintain its “viable syntax transformation”.
A solution that will work with |>
(and one that also works with %>%
) is the embedding an anonymous function.
An anonymous function is a function that is not assigned a name. For example, the following function is a named function.
<- function(x) sqrt(x) my_fun
The above code chunk creates a function named my_fun()
. Naming a function allows us to reuse this function anywhere in an R session. For example,
my_fun(20)
## [1] 4.472136
my_fun(3)
## [1] 1.732051
An anomalous function is only used once and is usually embedded inside other functions such as apply
or its many variants. The structure of an anonymous function looks like:
function(x) sqrt(x)) () (
Continuing with the plot
function example, using an anonymous function to explicitly indicate where to place the LHS in the RHS function would look like:
|> (function(x) plot(mpg ~ hp, data = x)) () mtcars
Here, we explicitly define the placeholder name (x
in the above example). But note that you could use any other accepted names, even the .
character.
Anonymous functions also work with the %>%
pipe.
%>%
and |>
Under the hood, the native pipe is distinctly different from its magrittr counterpart. %>%
is a function while |>
is not. This adds a small overhead to the %>%
operation. |>
is nothing more than a syntactic translation which means that R will parse 10 |> sqrt()
as sqrt(10)
. On the other hand, 10 %>% sqrt
is parsed as %>%(10, sqrt())
, i.e. two functions are processed instead of one.
This overhead will not be noticeable to most users. But if you are running a series of piping operations in a loop, that overhead may have a measurable impact in performance. The following plot compares the performance between sqrt(10)
, 10 |> sqrt()
and 10 %>% sqrt()
. Each code is run 10 million times.
As expected, 10 |> sqrt()
’s performance is identical to sqrt(10)
(recall that |>
is a simple syntax transformation and not a function).
\
R 4.1 also introduces a shorthand for the function()
function. This can help reduce code syntax. The shorthand notation may help keep lines of code short when implementing an anonymous function. For example, the following two lines of code perform the exact same operation.
|> ( function(x) plot(hp ~ mpg, x)) ()
mtcars |> ( \(x) plot(hp ~ mpg, x)) () mtcars
The shorthand notation can also be used with named functions:
<- function(x,y) x + y
f1 <- \(x,y) x + y f1
However, the shorthand notation may impede readability–it’s easier to spot function
than it is to spot \
when scanning for a formula definition in an R script.
R version 4.1 adds new categorical color palettes. Previous to 4.1, R offered the following categorical color palette:
# Before version 4.1
palette()
[1] "black" "red" "green3" "blue" "cyan" "magenta" "yellow" "gray"
R version 4.1 offers a different set of colors that do a better job in preserving perceived consistency in lightness and saturation dimensions.
# Version 4.1 and later
palette()
## [1] "black" "#DF536B" "#61D04F" "#2297E6" "#28E2E5" "#CD0BBC" "#F5C710"
## [8] "gray62"
But R 4.1 offers additional categorical palettes for a total of 16 palettes. The palette names can be listed via the new palette.pals()
function.
palette.pals()
## [1] "R3" "R4" "ggplot2" "Okabe-Ito"
## [5] "Accent" "Dark 2" "Paired" "Pastel 1"
## [9] "Pastel 2" "Set 1" "Set 2" "Set 3"
## [13] "Tableau 10" "Classic Tableau" "Polychrome 36" "Alphabet"
To view the list of colors associated with a palette (e.g. the "Accent"
palette), type the following:
palette("Accent")
palette()
## [1] "#7FC97F" "#BEAED4" "#FDC086" "#FFFF99" "#386CB0" "#F0027F" "#BF5B17"
## [8] "gray40"
Note that the first line of code in the above code chunk will change the default color palette to "Accent"
for the current R session.
boxplot(log(decrease) ~ treatment, data = OrchardSprays,
col = OrchardSprays$treatment)
If you want to revert the palette back to the default, set the palette name to "R4"
.
palette("R4")
palette()
## [1] "black" "#DF536B" "#61D04F" "#2297E6" "#28E2E5" "#CD0BBC" "#F5C710"
## [8] "gray62"
boxplot(log(decrease) ~ treatment, data = OrchardSprays,
col = OrchardSprays$treatment)
If you want to replicate the default color palette available in R prior to version 4.1, set the palette name to "R3"
.
palette("R3")
palette()
## [1] "black" "red" "green3" "blue" "cyan" "magenta" "yellow"
## [8] "gray"
boxplot(log(decrease) ~ treatment, data = OrchardSprays,
col = OrchardSprays$treatment)
The palettes in R 4.1 vary in the number of color swatches. The following plot shows all colors available for each palette.
Manuel Gimond, 2021