Variables can be mapped in R using packages such as ggplot2 and tmap. The focus of this tutorial will be on mapping polygon (aka areal) data. For more information on mapping spatial data (both vector and raster) visit the GIS and Spatial Analysis pages.

ggplot2 and tmap

There are at least two mapping solutions in R: ggplot2 and tmap. The bulk of this tutorial will make use of tmap but a brief example of ggplot2’s mapping capabilities will be presented next.

Generating a basic map in ggplot2

Mapping values requires two datasets: the values to be mapped and the spatial boundaries of the areas being mapped. The maps package offers R-ready maps for US states/counties and administrative boundaries for other countries such as Canada, France and Italy.

In the following example we will use ggplot2 to draw the US counties map. We will also color the counties based on the state they are located in. The state name column is labeled region and the county name column is labeled subregion.

library(ggplot2)
library(maps)
library(sf)

# Load the county data from the maps package
cnty <- st_as_sf(map("county", plot = FALSE, fill = TRUE))

# Plot the counties
ggplot(cnty) +  geom_sf() 

The default ggplot grey theme works well for non-spatial data, but it may not be that suitable for spatial data. We can remove it with the theme_bw() option.

# Plot the counties
ggplot(cnty) + 
  geom_sf() + 
  theme_bw()

You can modify the map projection by passing the projection type to the coord_sf function. The projection is defined using PROJ4 formatted string. For example, to project the continental US into a Lambert equal area projection, type:

# Define the projection using PROJ4 syntax
ea <- "+proj=laea +lat_0=45 +lon_0=-100 +x_0=0 +y_0=0 +a=6370997 +b=6370997 +units=m +no_defs"

ggplot(cnty) + 
  geom_sf() + 
  theme_bw() +
  coord_sf(crs=ea) 

To learn more on spatial projections in R click here.

Generating a basic map in tmap

While ggplot2 can handle basic mapping operations, it does not offer the ease and flexibility of the tmap package. Much like ggplot2, tmap adopts the same layering strategy. For example, to construct the above plot we first specify the layer to plot via the call to tm_shape, then the geometry to symbolize the spatial feature (tm_polygons()).

library(tmap)

tm_shape(cnty) +  tm_polygons()

To change the projection, pass the PROJ4 string to the projection parameter. To add a lat/long graticule, use the tm_grid() function. As with ggplot2, map elements are layered in the order in which they are called.

tm_shape(cnty, projection = ea) + 
  tm_grid(projection = "+proj=longlat", col = "grey", labels.inside.frame = FALSE) +
  tm_polygons() +
  tm_layout(outer.margins = c(.1,.1,.1,.1))

While grids are helpful in locating features on a map, they are not always necessary and can take away from the intended visual hierarchy. In what follows, we will be focusing on choropleth maps whereby the intended focus will be on the relative distribution of a variable across polygon units. We’ll therefore refrain from adding a grid to the map.

Click here to explore additional mapping options with tmap.

Joing tables to spatial objects

In the examples that follow, you will learn how to join census income data to a spatial object in one of two ways: by state and county name, and by FIPS (Federal Information Processing Standards) id.

Joining income table to the map by county and state names

The next step is to join a dataset to the counties map. We will use the income-by-gender-education dataset used earlier in the course and limit the data to median income per person (B20004001 column).

library(dplyr)
df <- read.csv("http://mgimond.github.io/ES218/Data/Income_education.csv")
df1 <- df %>% select(subregion = County, region = State, B20004001 )

Next, we need to join the census data to the county map using two columns: state name and county name (or region and subregion columns). Let’s compare the two dataframes by viewing the first few rows of each.

head(cnty)
Simple feature collection with 6 features and 1 field
geometry type:  MULTIPOLYGON
dimension:      XY
bbox:           xmin: -88.01778 ymin: 30.24071 xmax: -85.06131 ymax: 34.2686
epsg (SRID):    4326
proj4string:    +proj=longlat +datum=WGS84 +no_defs
                        geometry              ID
1 MULTIPOLYGON (((-86.50517 3... alabama,autauga
2 MULTIPOLYGON (((-87.93757 3... alabama,baldwin
3 MULTIPOLYGON (((-85.42801 3... alabama,barbour
4 MULTIPOLYGON (((-87.02083 3...    alabama,bibb
5 MULTIPOLYGON (((-86.9578 33...  alabama,blount
6 MULTIPOLYGON (((-85.66866 3... alabama,bullock
head(df1)
  subregion region B20004001
1   Autauga     al     35881
2   Baldwin     al     31439
3   Barbour     al     25201
4      Bibb     al     29016
5    Blount     al     32035
6   Bullock     al     26408

We note two problems. First, the cnty object encodes states using their full name and not their two letter abbreviation. Second, the county names in cnty are in all lower case. Before attempting to join the dataframes, we must first fix these discrepancies. We will choose to modify df so that its state and county names match those in the cnty (map) dataframe.

We will first create a state name/abbreviation look-up table using the built-in state.name and state.abb objects. Note that using these pre-existing objects will require that we add the District of Columbia to the table since it’s not present in either data objects.

st <- data.frame(region=tolower(state.name), State = tolower(state.abb)) %>% 
      bind_rows( data.frame(region="district of columbia", State="dc") ) 

Next, we join the st look-up table to df then make the additional changes needed to join the census data to the counties map. Note that we are overwriting the earlier instance of df1.

df1 <- df %>% 
  inner_join(st, by="State") %>%
  mutate(ID = paste(region,tolower(County), sep = ","))  %>%
  select(ID, B20004001 )

We can now join df1 to cnty.

cnty.df1 <- inner_join(cnty, df1, by="ID" )

Now let’s map the income distribution.

tm_shape(cnty.df1) + tm_fill(col = "B20004001", palette = "Greens") +
  tm_legend(outside = TRUE) 

You may notice that the counties map does not seem complete; most notable is the absence of all Louisiana counties. Let’s look at the rows in cnty.df1 associated with the state of Louisiana.

library(stringr)
table(str_detect(cnty.df1$ID,  "louisiana") )

FALSE 
 2971 

There are no rows returned (i.e. all cases returned FALSE). This suggests that the Louisiana counties did not properly join. Let’s compare the Louisiana county names between df1 and cnty.

df1 %>% filter(str_detect(ID,  "louisiana")) %>% head()
                           ID B20004001
1     louisiana,acadia parish     28678
2      louisiana,allen parish     29870
3  louisiana,ascension parish     43464
4 louisiana,assumption parish     34269
5  louisiana,avoyelles parish     26483
6 louisiana,beauregard parish     33983
cnty %>% filter(str_detect(ID,  "louisiana")) %>% head()
Simple feature collection with 6 features and 1 field
geometry type:  MULTIPOLYGON
dimension:      XY
bbox:           xmin: -93.74163 ymin: 29.62765 xmax: -90.63619 ymax: 31.35225
epsg (SRID):    4326
proj4string:    +proj=longlat +datum=WGS84 +no_defs
                    ID                       geometry
1     louisiana,acadia MULTIPOLYGON (((-92.61863 3...
2      louisiana,allen MULTIPOLYGON (((-92.5957 30...
3  louisiana,ascension MULTIPOLYGON (((-91.12894 3...
4 louisiana,assumption MULTIPOLYGON (((-91.23207 3...
5  louisiana,avoyelles MULTIPOLYGON (((-92.32642 3...
6 louisiana,beauregard MULTIPOLYGON (((-93.12856 3...

It turns out that Louisiana does not divide its administrative areas into counties but parishes instead. The income data (originally downloaded from the Census Bureau) follows through with that designation convention by adding the word “parish” to each of its administrative area names. We therefore need to remove all instances of parish in the subregion names associated with Louisiana. We will therefore need to recreate the df1 object as follows:

library(stringr)
df1 <- df %>% 
  inner_join(st, by="State") %>%
  mutate(ID = paste(region,tolower(County), sep = ","),
         ID = ifelse(region=="louisiana", 
                             str_replace(ID, " parish", ""), ID))  %>%
  select(ID, B20004001 )

Let’s re-join the dataframes and re-plot the map.

cnty.df1 <- inner_join(cnty, df1, by="ID" )

tm_shape(cnty.df1) + tm_fill(col = "B20004001", palette = "Greens") +
  tm_legend(outside = TRUE) 

Most of Louisiana’s parishes are now mapped, but we are still missing a few parishes as well as a few counties. This is a result of differences in spelling for two-word county and parish names. For example, df1 encodes St. Lucie county (Florida) as st. lucie whereas cnty omits the dot and encodes it as st lucie. Fixing these discrepancies will require additional labor. At this point, it may prove more fruitful to do away with state/county names as joining keys and use FIPS county codes instead. FIPS (Federal Information Processing Standards) codes assign each county/state a unique five number designation thus making it easier to join data to spatial features. However, neither our census data table nor the built-in counties map have FIPS codes. We will download another version of the income data (one that has FIPS codes). Note that FIPS codes are normally part of datasets provided by the Census Bureau. The maps package has a dataset called county.fips that can be used to match state/county names in the county map to FIPS codes.

Joining income table by FIPS code

Let’s download the FIPS version of the dataset. Note that we will not need to modify/add columns to the data since we will be joining this table to the county map using the FIPS code.

df <- read.csv("http://mgimond.github.io/ES218/Data/Income_education_with_FIPS.csv")
df1 <- df %>% select(FIPS, B20004001 )

Next, we’ll load the county.fips dataset.

# Load the county.fips dataset
data(county.fips)
head(county.fips)
  fips        polyname
1 1001 alabama,autauga
2 1003 alabama,baldwin
3 1005 alabama,barbour
4 1007    alabama,bibb
5 1009  alabama,blount
6 1011 alabama,bullock
cnty2 <- cnty %>%
         left_join(county.fips, by=c("ID" = "polyname"))

Now that the FIPS codes are part of the county map dataframe, let’s join the income data table to the county map, then map the income values.

cnty2.df1 <- inner_join(cnty2, df1, by=c("fips" = "FIPS") )

tm_shape(cnty2.df1) + tm_fill(col = "B20004001", palette = "Greens") +
  tm_legend(outside = TRUE) 

This is an improvement over the last map. But there are still a few counties missing. After a closer scrutiny of the data, it seems that the county.fips table splits some counties into two or more sub-regions. For example, Accomack county (Virginia) is split into two regions (and thus into two different names):

county.fips %>% filter(str_detect(polyname,  "virginia,accomack"))
   fips                       polyname
1 51001         virginia,accomack:main
2 51001 virginia,accomack:chincoteague

Fixing these discrepancies would require manual intervention and time. Another work around is to use the Census Bureau’s map files instead of maps’s built-in map. We will cover this in the next section.

Working with external GIS files

Reading a shapefile into R

The Census Bureau maintains its own library of administrative boundary shapefiles (shapefiles are popular map data formats). This may prove to be the best map source to use since its areal units are designed to match the records in the income data table. Loading a shapefile can easily be accomplished using the st_read() function from the sf package. One option is to download the shapefile from the Census Bureau’s website (usually in a zipped format) then unpack it in a local project folder before reading it into the current session with st_read(). But we can maximize the replicability of the workflow by writing a chunk of code that will download the zipped shapefile into a temporary directory before unzipping it and loading the shapefile into an active R session as shown below.

library(sf)

tmp <- tempdir()
link <- "http://www2.census.gov/geo/tiger/GENZ2010/gz_2010_us_050_00_20m.zip"
filename <- basename(link)
download.file(link, filename)
unzip(filename, exdir = tmp )
shapefile <- st_read(dsn = tmp, layer = tools::file_path_sans_ext(filename))
Reading layer `gz_2010_us_050_00_20m' from data source `C:\Users\mgimond\AppData\Local\Temp\Rtmp003anP' using driver `ESRI Shapefile'
Simple feature collection with 3221 features and 6 fields
geometry type:  MULTIPOLYGON
dimension:      XY
bbox:           xmin: -179.1473 ymin: 17.88481 xmax: 179.7785 ymax: 71.35256
epsg (SRID):    4269
proj4string:    +proj=longlat +datum=NAD83 +no_defs

The shapefile is unzipped into a temporary folder, tmp, then read from that same directory and stored in an sf object called shapefile. The code tools::file_path_sans_ext(filename) extracts the name of the zip file (minus the extension) which turns out to also be the name of the shapefile. Note that if the zip file was manually unpacked in a project folder then the only command that would have needed to be executed is the st_read function as in st_read("gz_2010_us_050_00_20m.shp" ).

The Census shapefile has a field called GEO_ID that includes the county FIPS codes along with other superfluous county ID values. We only need the last five digit county FIPS code so we’ll extract these codes into a new column called FIPS.

shapefile$FIPS <- as.numeric(str_sub(shapefile$GEO_ID, 10, 14))

Next we can append the income table to the shapefile object.

cnty2.cs <- shapefile %>%
            inner_join(df1, by="FIPS")

Now, we can plot the income distribution map. Note that the Census Bureau shapefile includes the 48 states, Alaska and Hawaii. If you want to limit the extent to the 48 states, define the boundary limits using the bbox parameter in the tm_shape function.

tm_shape(cnty2.cs, 
         bbox = st_bbox(c(xmin = -125, xmax = -67 , ymin = 25, ymax = 50))) +
  tm_fill(col = "B20004001", palette = "Greens", title = "Income") +
  tm_legend(outside = TRUE) 

Modifying the color scheme

The classification scheme can be customized. For example, to split the income values across six intervals (e.g. $0 to $20000, $20000 to $30000, $30000 to $50000, and $50000 to $100000) and to label the values as dollars, type the following:

tm_shape(cnty2.cs, 
         bbox = st_bbox(c(xmin = -125, xmax = -67 , ymin = 25, ymax = 50))) +
  tm_fill(col = "B20004001", palette = "Greens",
          style="fixed", breaks=c(0, 20000 , 30000, 50000, 100000 ),
          labels = c("under $20k", "$20k - $30k", "$30k - $50k", "above $50k"),
          title = "Income") +
  tm_legend(outside = TRUE) 

We might choose to map the income distribution using a divergent color scheme where we compare the counties to some overall value such as the counties’ median income value. We can use the quantile() function to break the income data into its percentiles. This will be helpful in generating symmetrical intervals about the median.

inc.qt <- quantile(df1$B20004001, probs = c(0, 0.125, .25, 0.5, .75, .875 , 1 ))

A divergent color scheme consists of two different hues whose lightness values converge towards some central value (the median income in our case). We’ll use a red-to-green color palette, RdYlGn, for this next map where red hues highlight counties below the county median income and green hues highlight counties above the county median value.

tm_shape(cnty2.cs,
         bbox = st_bbox(c(xmin = -125, xmax = -67 , ymin = 25, ymax = 50))) +
  tm_fill(col = "B20004001", palette = "RdYlGn",
          style="fixed", breaks = inc.qt,
          title = "Income") +
  tm_legend(outside = TRUE) 

The divergent map gives us a different handle on the data. Here, we can identify the poorer than average counties such as the southeast and parts of the Mississippi river region as well as the wealthier counties such as the mid-atlantic region and the west coast.