This is a standalone tutorial used in one of my courses. It introduces students to the Moran’s I analysis as well as some basic mapping features using
tmap
. This exercise also compares ArcMap’s Moran’s I results to those from a Monte Carlo simulation. For an overview of Moran’s I, see my lecture notes.
We will make use of the following packages: sf
for importing the shapefiles, tmap
for creating choropleth maps and spdep
for implementing the Moran’s I analysis.
The following chunk loads the data (as an sf
object) from the github site.
# Load the shapefile
s <- readRDS(url("https://github.com/mgimond/Data/raw/gh-pages/Exercises/nhme.rds"))
If you want to load a shapefile from a local directory, simply use the read_sf()
function as in
To list the column names associated with the object’s attribute table, type:
[1] "NAME" "STATE_NAME" "POP10_SQMI" "AVE_HH_SZ" "AVE_FAM_SZ"
[6] "Income" "House_year" "geometry"
To list the contents of an attribute, affix the dollar sign $
to the object name followed by the attribute name. For example, to list the income values, type:
[1] 45765 59560 46559 50515 50027 40695 55046 43484 56701 60782 52393
[12] 56139 55045 70906 65226 79368 59580 56851 37378 41665 45747 44543
[23] 37110 39792 38239 42407
The Moran’s I statistic is not robust to outliers or strongly skewed datasets. It’s therefore good practice to check the distribution of the attribute values. You can plot the attribute values as follows:
or,
Other than one outlier, the dataset seems to be well behaved. We’ll forgo any re-expression of the values going forward.
To symbolize the polygons using the Income
attribute we will first define the classification breaks (style = quantile
with n = 8
breaks) and the symbol colors (palette="Greens"
). For the latter, the tmap
package makes use of Cynthia Brewer’s color schemes (see her website).
The first step in a Moran’s I analysis requires that we define “neighboring” polygons. This could refer to contiguous polygons, polygons within a certain distance, or it could be non-spatial in nature and defined by social, political or cultural “neighbors”.
Here, we’ll adopt a contiguous neighbor definition. We’ll accept any contiguous polygons that share at least one vertex; this is the “queen” case (if one chooses to adopt the chess analogy) and it’s parameterized as queen = TRUE
in the call to poly2nb
. If we required that just edges be shared between polygons then we would set queen = FALSE
(the rook analogy).
For each polygon in our shape object, nb
lists all neighboring polygons. For example, to see the neighbors (by ID number) for the first polygon in the shape object, type:
[[1]]
[1] 2 3 6 7 20
Here’s the list of county names and associated IDs:
County | ID |
---|---|
Androscoggin | 1 |
Cumberland | 2 |
Kennebec | 3 |
Knox | 4 |
Lincoln | 5 |
Oxford | 6 |
Sagadahoc | 7 |
Waldo | 8 |
York | 9 |
Belknap | 10 |
Carroll | 11 |
Cheshire | 12 |
Grafton | 13 |
Hillsborough | 14 |
Merrimack | 15 |
Rockingham | 16 |
Strafford | 17 |
Sullivan | 18 |
Aroostook | 19 |
Franklin | 20 |
Hancock | 21 |
Penobscot | 22 |
Piscataquis | 23 |
Somerset | 24 |
Washington | 25 |
Coos | 26 |
Next, we need to assign weights to each neighboring polygon. In this example, each neighboring polygon will be assigned equal weight when computing the neighboring mean income values.
To see the weight of the first polygon’s neighbors type:
[[1]]
[1] 0.2 0.2 0.2 0.2 0.2
These are the weights each neighboring income value will be multiplied by before being summed. If a polygon has 5 neighbors, each neighbor will have a weight of 1/5 or 0.2. This weight will then be used to compute the mean neighbor values as in 0.2(neighbor1) + 0.2(neighbor2) + 0.2(neighbor3) + 0.2(neighbor4) + 0.2(neighbor5)
. This is equivalent to summing all five income values then dividing by 5.
NOTE: This step does not need to be performed when running the
moran
ormoran.test
functions outlined in Steps 4 and 5. This step is only needed if you wish to generate a scatter plot between the income values and their lagged counterpart.
Next, we’ll have R compute the average neighbor income value for each polygon. These values are often referred to as spatially lagged values.
[1] 48705.00 49551.75 45963.17 46755.50 48901.00 49748.50 50477.75
[8] 46197.17 53057.00 58061.00 52535.00 63878.50 55531.80 64396.00
[15] 63755.33 65237.33 62894.00 61829.00 39921.00 43202.75 42088.67
[22] 40291.67 40571.00 41789.83 42556.00 49377.67
You can plot the relationship between income and its spatially lagged counterpart as follows. The fitted blue line added to the plot is the result of an OLS regression model.
The slope of the line is the Moran’s I coefficient. You can extract its value from the model object M1
as follows:
s$Income
0.6827955
The moran’s I coefficient is 0.68. The positive (upward) slope suggests that as the income value of a said polygon increases, so does those of its neighboring polygons. If the slope were negative (i.e. sloping downward), this would suggest a negative relationship whereby increasing values in a said polygon would be surrounded by polygons with decreasing income values.
The Moran’s I statistic can be computed using the moran
function.
$I
[1] 0.6827955
Recall that the Moran’s I
value is the slope of the line that best fits the relationship between neighboring income values and each polygon’s income in the dataset.
The hypothesis we are testing states that “the income values are randomly distributed across counties following a completely random process”. There are two methods to testing this hypothesis: an analytical method and a Monte Carlo method. We’ll explore both approaches in the following examples.
To run the Moran’s I analysis using the analytical method, use the moran.test
function.
Moran I test under randomisation
data: s$Income
weights: lw
Moran I statistic standard deviate = 5.8525, p-value = 2.421e-09
alternative hypothesis: greater
sample estimates:
Moran I statistic Expectation Variance
0.68279551 -0.04000000 0.01525284
The Moran’s I statistic is 0.683 (same value that was computed using the moran
function, as expected). The p-value is very small. Usually, when the p-value is very small it’s common practice to report it as < 0.001
.
Note that ArcMap adopts this analytical approach to its hypothesis test however, it implements a two-sided test as opposed to the one-sided test adopted in the above example (i.e. alternative = "greater"
). A two-sided p-value is nothing more than twice the one-sided p-value. Unfortunately, ArcMap does not seem to make this important distinction in any of its documentation. This distinction can have important ramifications as shown in the next example (Florida crime data). Fortunately, the income data is so strongly clustered that both a one-sided and two-sided test produce the same outcome (a p-value close to 0).
The analytical approach to the Moran’s I analysis benefits from being fast. But it may be sensitive to irregularly distributed polygons. A safer approach to hypothesis testing is to run an MC simulation using the moran.mc()
function. The moran.mc
function takes an extra argument n
, the number of simulations.
Monte-Carlo simulation of Moran I
data: s$Income
weights: lw
number of simulations + 1: 1000
statistic = 0.6828, observed rank = 1000, p-value = 0.001
alternative hypothesis: greater
The MC simulation generates a very small p-value, 0.001. This is not surprising given that the income values are strongly clustered. We can see the results graphically by passing the Moran’s I model to the plot function:
The curve shows the distribution of Moran I values we could expect had the incomes been randomly distributed across the counties. Note that our observed statistic, 0.683, falls way to the right of the distribution suggesting that the income values are clustered (a positive Moran’s I value suggests clustering whereas a negative Moran’s I value suggests dispersion).
Now, had the Moran’s I statistic been negative (suggesting a dispersed pattern), you would probably want to set the alternative
argument to less
thus giving you the fraction of simulated I
values more dispersed than your observed I
value.
A visual exercise that you can perform to assess how “typical” or “atypical” your pattern may be relative to a randomly distributed pattern is to plot your observed pattern alongside a few simulated patterns generated under the null hypothesis.
set.seed(131)
s$rand1 <- sample(s$Income, length(s$Income), replace = FALSE)
s$rand2 <- sample(s$Income, length(s$Income), replace = FALSE)
s$rand3 <- sample(s$Income, length(s$Income), replace = FALSE)
tm_shape(s) + tm_fill(col=c("Income", "rand1", "rand2", "rand3"),
style="quantile", n=8, palette="Greens", legend.show = FALSE) +
tm_facets( nrow=1)
Can you tell the difference between our observed income distribution and those generated from a completely random process? The map on the left is our observed distribution. The three maps on the right are realizations of a completely random process.
In this example, we explore the spatial distribution of 1980 homicide rates HR80
by county for the state of Florida using the Monte Carlo approach.
The following code chunk highlights the entire workflow. Here, we’ll set the number of simulations to 9999
.
set.seed(2354)
# Load the shapefile
s <- readRDS(url("https://github.com/mgimond/Data/raw/gh-pages/Exercises/fl_hr80.rds"))
# Define the neighbors (use queen case)
nb <- poly2nb(s, queen=TRUE)
# Compute the neighboring average homicide rates
lw <- nb2listw(nb, style="W", zero.policy=TRUE)
# Run the MC simulation version of the Moran's I test
M1 <- moran.mc(s$HR80, lw, nsim=9999, alternative="greater")
# Plot the results
plot(M1)
# Display the resulting statistics
M1
Monte-Carlo simulation of Moran I
data: s$HR80
weights: lw
number of simulations + 1: 10000
statistic = 0.13628, observed rank = 9575, p-value = 0.0425
alternative hypothesis: greater
The MC simulation generated a p-value of ~0.04 suggesting that there would be a ~4% chance of being wrong in rejecting the null hypothesis or that there is a ~4% chance that our observed pattern is consistent with a random process (note that your simulated p-value may differ from the one shown here–the number of simulations may need to be increased to reach a more stable convergence). Recall that this is a one-sided test. ArcMap’s analytical solution adopts a two-sided test. To compare its p-value to ours, we need to divide its p-value by 2 (i.e. 0.0588 / 2
) which gives us a one-sided p-value of 0.0294
–about 25% smaller than our simulated p-value.
The wording adopted by ArcMap (version 10.6) under the infographic (highlighted in yellow in the above figure) is unfortunate. It seems to suggest a one-sided test by explicitly describing the nature of the pattern (i.e. clustered). A more appropriate statement would have been “There is less than a 6% likelihood that the observed pattern could be the result of random chance” (note the omission of the word clustered).
Manuel Gimond, 2019