22 Univariate spread-location and spread-level plots

dplyr	ggplot2	tukeyedar
1.1.4	3.5.1	0.4.0

In Chapters 19 and 20, we emphasized the many advantages of maintaining a homogeneous spread in residuals. One common cause of non-homogeneous spread is when the dataset exhibits a systematic change in variability as a function of location. In other words, the variability within each batch may depend on a location-based statistic, such as the mean or median. This type of dependency is often undesirable—especially in analyses like ANOVA—and should ideally be addressed.

To visualize this relationship, a spread-location plot (s-l) is highly effective. This plot helps identify patterns or trends in variability that may need correction before proceeding with further analysis.

22.1 The spread-location plot

The s-l plot visualizes the relationship between the residuals and the location for each batch of data (typically the median, due to its robustness against outliers) . The residuals are expressed as:

\[ spread_{i,grp} = \sqrt{|y_{i,grp} - median(y_{grp})|} \] \[ location_{grp} = median(y_{grp}) \]

For example, the following is an s-l plot of petal length vs species from the iris dataset.

The red line in the plot connects the median values of each batch of residuals. It helps identify the type of relationship between spread and location. If the line increases monotonically upward, there is an increasing spread as a function of increasing location; if the line decreases monotonically downward, there is a decreasing spread as a function of increasing location; and if the line is neither increasing nor decreasing monotonically, there is no change in spread as a function of location.

The s-l plot in the iris dataset suggests that the residuals increase monotonically with an increase in the fitted median value. Note that the x-axis is not categorical, it shows the median values for each species group.

Though not as effective in highlighting the heterogeneity in spread as in the s-l plot, the increase in spread can sometimes be observed in a boxplot of the data as shown in the following plot.

Here, the increasing width of the IQRs is noticeable.

Increasing or decreasing spreads as a function of fitted value can be corrected by re-expressing the continuous variable. For example, applying a log transformation helps stabilize the residuals of the petal length dataset as can be seen in the following s-l plot.

Note that the re-expression of the petal length values changes the group median values but not the rank of species.

22.2 Creating an s-l plot using `eda_sl`

An s-l plot can be generated using tukeyedar’s eda_sl function.

eda_sl(iris, Petal.Length, Species)

As with many tukeyedar functions used in the course, the power transformation applied to the data is shown in the upper right-hand corner of the plot. By default, the power transformation is 1 (i.e. an untransformed data).

The function allows you to apply the transformation without needing to do so outside of the function. By default, the Box-Cox transformation method is adopted. To adopt the Tukey method, set Tukey = TRUE.

In this following code block, we apply a power transformation of 0 (the log transformation) to the data.

eda_sl(iris, Petal.Length, Species, p = 0)

22.3 Creating an s-l plot with `ggplot`

The following code block demonstrates the steps required to create an s-l plot in ggplot.

library(dplyr)
library(ggplot2)

res.sq <-  iris %>% 
  group_by(Species) %>% 
  mutate(Median   = median(Petal.Length),
         Residual = sqrt(abs(Petal.Length - Median)))

ggplot(res.sq, aes(x=Median, y=Residual)) + 
  geom_jitter(alpha=0.4, width=0.05, height=0) +
  stat_summary(fun = median, geom = "line", col = "red") +
  ylab("Spread") +
  geom_text(aes(x = Median, y = 1.25, label = Species)) +
  xlim(1, 6.5)

Note that if you are to rescale the y-axis when using the stat_summary() function, you should use the coord_cartesian(ylim = c( .. , .. )) function instead of the ylim() function. The latter will mask the values above its maximum range from the stat_summary() function, the former will not.

22.4 The spread-level plot

A variation of the spread-location plot is the spread-level plot which pits the log of the inter-quartile spread against the log of the median for each group.

\[ spread_{grp} = log(IQR(y_{grp})) \] \[ location_{spread\ level} = log(median(y_{grp})) \]

This approach only works for positive non-zero values (this may require that values be adjusted so that the minimum value be greater than 0).

This version of the s-l plot is appealing in that the slope of the best fit line can suggest a power transformation via \(power = 1 - slope\).

This variant of the s-l plot can be implemented in the eda_sl function by setting the argument type = "level".

eda_sl(iris, Petal.Length, Species, type = "level",
       loess.d = list(degree = 1, span = 1.5))

Slope =  1.051237

Note how this plot differs from our earlier s-l plot in that we are only displaying each batch’s median spread value, and we are fitting a straight line to the medians instead of connecting them.

The function will return the slope of the line in the console window. Here, the computed slope is 1.05 which suggests a power of \(1 - 1.05 = -.05\). This is a power transformation very close to the log transformation used earlier in this chapter.

A ggplot implementation of this variant of the s-l plot is shown next:

sl <- iris %>%
  group_by(Species)  %>%
  summarise (level  = log(median(Petal.Length)),
                IQR = IQR(Petal.Length),  # Computes the interquartile range
             spread = log(IQR))

ggplot(sl, aes(x = level, y = spread)) + geom_point() + 
  stat_smooth(method = MASS::rlm, se = FALSE) +
  xlab("Location") + ylab("Spread") +
  geom_text(aes(x = level, y = spread, label = Species), cex=2.5)

22.1 The spread-location plot

22.2 Creating an s-l plot using eda_sl

22.3 Creating an s-l plot with ggplot

22.4 The spread-level plot

22.2 Creating an s-l plot using `eda_sl`

22.3 Creating an s-l plot with `ggplot`