The data files used in this tutorial were created in an earlier exercise. Type the following command to download the objects.
This should load several data frame objects into your R session (note that not all are used in this exercise). The
dat1l dataframe is a long table version of the crop yield dataset.
Year Crop Yield 1 1961 Barley 16488.52 2 1962 Barley 18839.00 3 1963 Barley 18808.27
Country to the
Year Crop Country Yield 1 2012 Barley Canada 38894.66 2 2012 Maize Canada 83611.49 3 2012 Oats Canada 24954.79
dat1w dataframe is a wide table version of
Year Barley Buckwheat Maize Oats Rye 1 1961 16488.52 10886.67 39183.63 15171.26 11121.79 2 1962 18839.00 11737.50 40620.80 16224.60 12892.77 3 1963 18808.27 11995.00 42595.55 16253.04 11524.11
dat2 dataframe is a wide table representation of income by county and by various income and educational attainment levels. The first few lines and columns are shown:
County State B20004001 B20004002 B20004003 B20004004 B20004005 1 Autauga al 35881 17407 30169 35327 54917 2 Baldwin al 31439 16970 25414 31312 44940 3 Barbour al 25201 15643 20946 24201 42629
dat2c is a long version of
State County Level Region All F M 1 ak Aleutians East Borough All West 21953 20164 22940 2 ak Aleutians East Borough NoHS West 21953 19250 22885 3 ak Aleutians East Borough HS West 20770 19671 21192
ggplot2 package is designed around the idea that statistical graphics can be decomposed into a formal system of grammatical rules. The
ggplot2 learning curve is the steepest of all graphing environments encountered thus far, but once mastered it affords the greatest control over graphical design. For an up-to-date list of
ggplot2 functions, you may want to refer to ggplot2’s website.
A plot in
ggplot2 consists of different layering components, with the three primary components being:
Additional (optional) layering components include:
ggplot2 functions, you will need to load its library:
From a grammatical perspective, a scientific graph is the conversion of data to aesthetic attributes and geometric objects. This is an important concept to grasp since it underlies the construction of all graphics in
For example, if we want to generate a point plot of crop yield as a function of year using the
dat1l data frame, we type:
where the function,
ggplot(), is passed the data frame name whose content will be mapped; the
aes() function is given data-to-geometry mapping instructions (
Year is mapped to the x-axis and
Yield is mapped to the y-axis); and
geom_line() is the geometry type.
If we wanted to include a third variable, crop type, to the map, we would need to map its aesthetics: here we’ll map
Crop to the color element of the geom.
color acts as a grouping parameter whereby the groups are assigned unique colors.
If we want to plot lines instead of points, simply substitute the geometry type with the
Note that the aesthetics are still mapped in the same way with
Year mapped to the x coordinate,
Yield mapped to the y coordinate and
Crop mapped to the geom’s color.
Note that the parameters
y= can be omitted from the syntax reducing the line of code to:
Examples of a few available geometric elements follow.
geom_line generates line geometries. We’ll use data from
dat1w to generate a simple plot of oat yield as a function of year.
Parameters such as color and linetype can be passed directly to the
Note the difference in how
colour= is implemented here. It’s no longer mapping a variable’s levels to a range of colors as when it’s called inside of the
aes() function, instead, it’s setting the line color to
This generates point geometries. This is often used in generating scatterplots. For example, to plot male income (variable
B20004013) vs female income (variable
We modify the point’s transparency by passing the
alpha=0.3 parameter to the
geom_point function. Other parameters that can be passed to point geoms include
pch (point symbol type) and
cex (point size as a fraction).
When a bivariate scatter plot has too many overlapping points, it may be helpful to bin the observations into regular hexagons. This provides the number of observations per bin.
binwidth argument defines the width and height of each bin in the variables’ axes units.
In the following example, a boxplot of
Yield is generated for each crop type.
If we want to generate a single boxplot (for example for all yields irrespective of crop type) we need to pass a dummy variable to
A violin plot is a symmetrical version of a density plot which provides greater detail of a sample’s distribution than a boxplot.
Histograms can be plotted for single variables only (unless faceting is used) as can be noted by the absence of a
y= parameter in
The bin widths can be specified in terms of the value’s units. In our example, the unit is yield of oats (in Hg/Ha). So if we want to generate bin widths that cover 1000 Hg/Ha, we can type,
If you want to control the number of bins, use the parameter
bins= instead. For example, to set the number of bins to 8, modify the above code chunk as follows:
Bar plots are used to summaries the counts of a categorical value. For example, to plot the number of counties in each state (note that each record in
dat2 is assigned a county):
To sort the bars by length we need to rearrange the
State factor level order based on the number of counties in each state (which is the number of times a state appears in the data frame). We’ll make use of
fct_infreq function to reorder the State factor levels based on frequency.
If we want to reverse the order (i.e. plot from smallest number of counties to greatest), wrap the
fct_infreq function with
geom_bar function can also be used with count values (i.e. variable already summarized by count). First, we’ll summaries the number of counties by state using the
dplyr package. This will generate a data frame with just 51 records: one for each of the 50 states and the District of Columbia.
# A tibble: 6 x 2 State Counties <fct> <int> 1 ak 28 2 al 67 3 ar 74 4 az 15 5 ca 57 6 co 62
When using summarized data, we must pass the parameter
stat="identity" to the
geom_bar function. We must also explicitly map the x and y axes geometries. Note that since we are now generating bar heights from a value field and not a frequency, we will need to use another
forcats ordering function called
fct_reorder. This function takes three parameters: the variable to be ordered (
State), the variable whose values will determine the order (
Counties) and the function,
fun=, which defines the statistic used to summaries the sorting variable. Since there is just one value per state, we can use any summary statistic such as
Note that you can replace
fct_reorder with the base function
reorder for succinctness sake.
The dot plot is an alternative way to visualize counts as a function of a categorical variable. Instead of mapping
State to the x-axis, we’ll map it to the y-axis.
Dot plot graphics benefit from sorting–more so then bar plots. Here, we’ll make use of
forcats::fct_reorder function (see last section on
Geometries can be layered. For example, to overlay a linear regression line to the data we can add the
geom_smooth can be used to fit other lines such as a loess:
The confidence interval can be removed from the smooth geometry by specifying
se = FALSE.
You can plot different datasets on the same ggplot canvas. This approach usually requires that each geom be assigned its own dataset and aesthetics. In the following example, we’ll compute the High Density Region (HDR) boxplot parameters using the
hdrcde package, then use these parameters to construct
geom_rect() that will be added to a
geom_point element built from the parent dataset. When data and aesthetics are defined in a geom function, the
ggplot() function must be explicitly called right from the beginning.
library(hdrcde) # Compute the HDR boxplot hdr1d <- hdr(dat2$B20004013, prob = c(25, 50, 99)) # Build the data frame that will store the geom_rect variables hdr.df <- data.frame(hdr1d$hdr) cols <- grey(1/(1:nrow(hdr.df)) ) # Build the plot ggplot() + geom_point(data = dat2, aes(x = 0 , y = B20004013), pch = 16, alpha = 0.5) + geom_rect(dat = hdr.df, aes( xmin = -0.2, xmax = 0.2, ymin = X1, ymax = X2), fill = cols, colour = "grey") + theme(axis.title.x = element_blank(), # Remove x-axis axis.text.x = element_blank(), axis.ticks.x = element_blank())
The HBR boxplot is designed to visualize multi-modality on the dataset (i.e. data having more than one peak). In the above example, the boxes are encompassing 25%, 50% and 99% of the data.
You can add a plot title using the
Axes titles can be explicitly defined using the
To remove axis labels, simply pass
NULL to the functions as in
You can customize an axis’ label elements. If you are mapping continuous values along the x and y axes, use the
scale_y_continuous() functions. For example, to specify where to place the tics and the accompanying labels, type:
If you want to change the label formats whereby the numbers are truncated to a thousandth of their original value, you can make use of
unit_format() from the
scales package also has a
comma_format() function that will add commas to large numbers:
You can rotate axes labels using the
hjust argument justifies the values horizontally. Its value ranges from
0 is completely left-justified and
1 is completely right-justified. Note that the justification is relative to the text’s orientation and not to the axis. So it may be best to first rotate the label values, then to adjust justification based on the plot’s look as needed.
If you want the label values rotated 90° you might also need to justify vertically (relative to the text’s orientation) using the
vjust argument where
0 is completely top-justified and
1 is completely bottom-justified.
The axis range can be set using
However, if you are calling the
scale_y_continuous() functions, you do not want to use
ylim instead, you should add the
limit= argument to the aforementioned functions. For example,
You can explicitly define the breaks with the
breaks argument. Continuing with the last example, we get:
ggplot(dat2, aes(x = B20004013, y = B20004007)) + geom_point(alpha = 0.3) + xlab("Female income ($)") + ylab("Male income ($)") + scale_x_continuous(limit = c(10000, 75000), labels = scales::comma_format(), breaks = c(10000, 30000, 50000, 70000)) + scale_y_continuous(limit = c(10000, 75000), labels = scales::comma_format(), breaks = c(10000, 30000, 50000, 70000))
Note that the
breaks argument can be used in conjunction with other arguments (as shown in this example), or by itself.
If you wish to apply a non-linear transformation to either axes (while preserving the untransformed axis values) add the
coord_trans() function as follows:
You can also transform the y-axis by specifying the parameter
log transformation defaults to the natural log. For a log base 10, use
"log10" instead. For a square root transformation, use
"sqrt". For the inverse use
Advanced transformations can be called via the
scales package. For example, to implement the box-cox transformation (with a power of
Note that any statistical geom (such as the regression line) will be applied to the un-transformed data. So a linear model may end up looking non-linear after an axis transformation:
If a linear fit is to be applied to the transformed data, a better alternative is to transform the values instead of the axes. The transformation can be done on the original data or it can be implemented in ggplot using the
scale_y_continuous functions will accept
scales transformation parameters–e.g.
scale_x_continuous(trans = scales::boxcox_trans(-0.3)). Note that the parameter
breaks is not required but is used here to highlight the transformed nature of the axis.
You can impose an aspect ratio to your plot using the
coord_equal() function. For example, to set the axes units equal (in length) to one another set
You can customize geom colors using one of two sets of color schemes: one for continuous values, the other for categorical (discrete) values.
A few examples follow.
The following chunk of code summarizes
dat2 by tallying the number of counties in each state and by computing the median county income values.
# A tibble: 6 x 3 State Counties Income <fct> <int> <dbl> 1 ak 28 33980. 2 al 67 28946 3 ar 74 26320. 4 az 15 28799 5 ca 57 33438 6 co 62 30892.
The following chunk applies a green to red color gradient fill to each bar based on the median county incomes. Note that we are using the summarized count table (and not the original
dat2 table). Recall that when plotting bars from counts that are already tabulated we must specify
stat="identity" in the
The following chunk applies a divergent color scheme while allowing one to specify the central value of this scheme. Note that the colors are symmetrical about the midpoint which may result in only a partial range of the full possible gradient of colors.
In the last two chunk, we filled the bars with colors (note the use of functions with the string
_fill_). When assigning color to point or line symbols, use the function with the
_colour_ string. For example:
In the following chunk, we assign colors manually to each level in the variable
Yield. The order of the color names mirror the order of the variable levels.
The following chunk applies a predefined discrete color scheme using one of Brewer’s preset qualitative colors,
Dark2, to each level.
You can also apply sequential or divergent Brewer color schemes to variables having an implied order.
Let’s assume that there is an implied order to the crop types. For example, we’ll reorder the crop types based on their median yield (this creates an ordered factor from the
Crop variable). We can then use one of Brewer’s sequential color schemes such as
Note that we’ve added a
guides() function to rename the legend title. This is not needed to generate the sequential colors.
To reverse the color scheme, set the
direction argument to
You can view a list of predifined Brewer color schemes using
You can embed math symbols using
plotmath’s mathematical expressions by wrapping these expressions in an
expression() function. For example,
To view the full list of mathematical expressions, type
?plotmath at a command prompt.
Faceting (or conditioning on a variable) can be implemented in
ggplot2 using the
~ Country tells ggplot to condition the plots on country. If we wanted the plots to be stacked, we would set
We can also condition the plots on two variables such as
Country. In essence, this stacks the categories. (Note that we will also rotate the x-axis labels to prevent overlaps).
You can wrap the facet headers using the
label_wrap_gen() function as an argument value to
labeler. For example, to wrap the
United States of America value, we’ll specify the maximum number of characters per line using the
facet_wrap example generated unique combinations of the variables
Country. But such plots are usually best represented in a grid structure where one variable is spread along one axis and the other variable is spread along another axis of the plot layout. This can be accomplished using the
In the above examples, we are faceting the plots based on a categorical variable:
crop. But what if we want to facet the plots based on a continuous variable? For example, we might be interested in comparing male and female incomes across different female income ranges. This requires that a new categorical field (a factor) be created assigning to each case (row) an income group. We can use the
cut() function to accomplish this task (we’ll also omit all values greater than 100,000):
State County Level Region All F M incrng 1 ak Aleutians East Borough All West 21953 20164 22940 (0,2.5e+04] 2 ak Aleutians East Borough NoHS West 21953 19250 22885 (0,2.5e+04] 3 ak Aleutians East Borough HS West 20770 19671 21192 (0,2.5e+04] 4 ak Aleutians East Borough AD West 26383 26750 26352 (2.5e+04,5e+04] 5 ak Aleutians East Borough BD West 22431 19592 27875 (0,2.5e+04] 6 ak Aleutians East Borough Grad West 74000 74000 71250 (5e+04,7.5e+04]
In this chunk of code, we create a new variable,
incrng, which is assigned an income category group depending on which range
dat2c$F (female income) falls into. The income interval breaks are defined in
breaks=. In the output, you will note that the factor
incrng defines a range of incomes (e.g.
(0 , 2.5e+04]) where the parenthesis
( indicates that the left-most value is exclusive and the bracket
] indicates that the right-most value is inclusive.
However, because we did not create categories that covered all income values in
dat2c$F we ended up with a few
NA’s in the
(0,2.5e+04] (2.5e+04,5e+04] (5e+04,7.5e+04] (7.5e+04,1e+05] NA's 9419 7520 1425 33 5
We will remove all rows associated with missing
(0,2.5e+04] (2.5e+04,5e+04] (5e+04,7.5e+04] (7.5e+04,1e+05] 9419 7520 1425 33
We can list all unique levels in our newly created factor using the
 "(0,2.5e+04]" "(2.5e+04,5e+04]" "(5e+04,7.5e+04]" "(7.5e+04,1e+05]"
The intervals are not meaningful displayed as is (particularly when scientific notation is adopted). So we will assign more meaningful names to our factor levels as follows:
State County Level Region All F M incrng 1 ak Aleutians East Borough All West 21953 20164 22940 Under 25k 2 ak Aleutians East Borough NoHS West 21953 19250 22885 Under 25k 3 ak Aleutians East Borough HS West 20770 19671 21192 Under 25k 4 ak Aleutians East Borough AD West 26383 26750 26352 25k-50k 5 ak Aleutians East Borough BD West 22431 19592 27875 Under 25k 6 ak Aleutians East Borough Grad West 74000 74000 71250 50k-75k
Note that the order in which the names are passed must match that of the original breaks.
Now we can facet male vs. female scatter plots by income ranges. We will also throw in a best fit line to the plots.
One reason we would want to explore our data across different ranges of value is to assess the consistency in relationship between variables. In our example, this plot helps assess whether the relationship between male and female income is consistent across income groups.
We will add a 45° line using
geom_abline (where the intercept will be set to
0 and the slope to 1) to help visualize the discrepancy between the batches of values. So if a point lies above the 45° line, then male’s income is greater, if the point lies below the line , then female’s income is greater.
To help highlight differences in income, we will make a few changes to the faceted plots. First, we will reduce the y-axis range to $0-$150k (this will remove a few points from the data); we will force the x-axis and y-axis units to match so that a unit of $50k on the x-axis has the same length as that on the y-axis. We will also reduce the number of x tics and assign shorthand notation to income values (such as “50k” instead of “50000”). All this can be accomplished by adding the
scale_x_continuous() function to the stack of ggplot elements.
ggplot(dat2c, aes(x = F, y = M)) + geom_point(alpha = 0.2, pch = 20, cex = 0.8) + ylim(0, 150000) + geom_smooth(method = "lm", col = "red") + facet_grid( . ~ incrng) + coord_equal(ratio = 1) + geom_abline(intercept = 0, slope = 1, col = "grey50") + scale_x_continuous(breaks = c(50000, 100000), labels = c("50k", "100k"))
Note the change in regression slope for the last facet. The
geom_smooth operation is only applied to the data limited to the axis range defined by
Now let’s look at the same data but this time conditioned on educational attainment.
# Plot M vs F by educational attainment except for Level == All ggplot(dat2c, aes(x = F, y = M)) + geom_point(alpha = 0.2, pch = 20, cex = 0.8) + ylim(0, 150000) + geom_smooth(method = "lm", col = "red") + facet_grid( . ~ Level) + coord_equal(ratio = 1) + geom_abline(intercept = 0, slope = 1, col = "grey50") + scale_x_continuous(breaks = c(50000, 100000), labels =c("50k", "100k"))
We can also condition the plots on two variables: educational attainment and region.
ggplot(dat2c, aes(x = F, y = M)) + geom_point(alpha = 0.2, pch = 20, cex = 0.8) + ylim(0, 150000) + geom_smooth(method = "lm", col = "red") + facet_grid( Region ~ Level) + coord_equal(ratio = 1) + geom_abline(intercept = 0, slope = 1, col = "grey50") + scale_x_continuous(breaks = c(50000, 100000), labels = c("50k", "100k"))
You can create so-called heat maps by tiling the data. This typically requires the use of three variables–two of which are either categorical or have equally spaced continuous values that define a rectangular grid layout, and the third that defines the grid cells’ color. For example, a tile plot can be created showing the median income (for all sexes) as a function of education level and region.
The above example adopts a continuous color scheme. If you want to bin the color swatches using user defined breaks, swap the
scale_fill_gradient function with the
3.3, you can use the
guide_coloursteps function to control the look of your legend. In the last figure, the breaks are not even, yet the legend splits the color swatches into equal length units. Setting the
even.steps argument to
FALSE scales the color swatches to match the true interval lengths.
scale_fill_binned function offers additional control over the legend such as its height (
barheight), width (
barwidth) and the display of the minimum and maximum values (
You can export a ggplot figure to an image using the
ggsave function. For example,
height arguments are defined in units of inches,
in. You can also specify these parameters in units of centimeters by setting
units = "cm". The
device argument controls the image file type. Other file types include
"svg" just to name a few.
For greater control of the font sizes, you need to make use of the
theme function when buiding the plot.
p1 <- ggplot(dat1l2, aes(x = Year, y = Yield, color = Crop)) + geom_line() + facet_wrap( ~ Country, nrow = 1) + scale_y_continuous(labels = scales::comma_format()) + theme(axis.text = element_text(size = 8, family = "mono"), axis.title = element_text(size = 11, face = "bold"), strip.text = element_text(size = 11, face="italic", family = "serif"), legend.title = element_text(size = 10, family = "sans"), legend.text = element_text(size = 8, color = "grey40")) ggsave("fig1.png", plot = p1, width = 6, height = 2, units = "in")
family argument controls the font type. It does not automatically access all the fonts in your operating system. The three R fonts accessible by default are
"mono". These are usually mapped to your system’s fonts.
To access other fonts on your operating system, you will need to make use of the
showtext package. The package is not covered in this tutorial, instead, refer to the package’s website for instructions on using the package.