dplyr | ggplot2 | tukeyedar |
---|---|---|
1.1.4 | 3.5.1 | 0.4.0 |
22 Univariate spread-level plot
In Chapter 19, we emphasized the importance of maintaining a homogeneous spread in residuals. One common cause of non-homogeneous spread is when the dataset exhibits a systematic change in variability as a function of location. In other words, the variability within each batch may depend on a location-based statistic, such as the mean or median. This type of dependency is often undesirable—especially in analyses like ANOVA—and should ideally be addressed.
To visualize this relationship, a spread-level plot (s-l)–sometimes referred to as a spread-location plot–is highly effective. This plot helps identify patterns or trends in variability that may need correction before proceeding with further analysis.
22.1 Constructing the (univariate) s-l plot
The s-l plot visualizes the relationship between the residuals and the location for each batch of data (typically the median, due to its robustness against outliers) . The residuals are expressed as the square root of their absolute value.
For example, the following is an s-l plot of petal length vs species from the iris
dataset.
The red line in the plot connects the median values of each batch of residuals. It helps identify the type of relationship between spread and location. If the line increases monotonically upward, there is an increasing spread as a function of increasing location; if the line decreases monotonically downward, there is a decreasing spread as a function of increasing location; and if the line is neither increasing nor decreasing monotonically, there is no change in spread as a function of location.
The s-l plot in the iris
dataset suggests that the residuals increase monotonically with an increase in the fitted median value. Note that the x-axis is not categorical, it shows the median values for each species group.
Though not as effective in highlighting the heterogeneity in spread as in the s-l plot, the increase in spread can be observed in a boxplot of the data.
Here, the increasing width of the IQRs is also noticeable.
Increasing or decreasing spreads as a function of fitted value can be corrected by re-expressing the continuous variable. For example, applying a log transformation helps stabilize the residuals as can be seen in the following s-l plot.
Note that the re-expression of the petal length values changes the group median values but not the rank of species.
22.2 Creating an s-l plot using eda_sl
An s-l plot can be generated using tukeyedar
’s eda_sl
function.
eda_sl(iris, Petal.Length, Species)
As with many tukeyedar
functions used in the course, the power transformation applied to the data is shown in the upper right-hand corner of the plot. By default, the power transformation is 1
(i.e. an untransformed data).
The function allows you to apply the transformation without needing to do so outside of the function. By default, the Box-Cox transformation method is adopted. To adopt the Tukey method, set Tukey = TRUE
.
In this following code block, we apply a power transformation of 0
(the log transformation) to the data.
eda_sl(iris, Petal.Length, Species, p = 0)
22.3 Creating an s-l plot with ggplot
The following code block demonstrates the steps required to create an s-l plot in ggplot
.
library(dplyr)
library(ggplot2)
<- iris %>%
res.sq group_by(Species) %>%
mutate(Median = median(Petal.Length),
Residual = sqrt(abs(Petal.Length - Median)))
ggplot(res.sq, aes(x=Median, y=Residual)) +
geom_jitter(alpha=0.4, width=0.05, height=0) +
stat_summary(fun = median, geom = "line", col = "red") +
ylab("Spread") +
geom_text(aes(x = Median, y = 1.25, label = Species)) +
xlim(1, 6.5)
Note that if you are to rescale the y-axis when using the
stat_summary()
function, you should use thecoord_cartesian(ylim = c( .. , .. ))
function instead of theylim()
function. The latter will mask the values above its maximum range from thestat_summary()
function, the former will not.
22.4 A variation of the s-l plot
Another version of the s-l plot pits the log of the inter-quartile spread against the log of the median. This approach only works for positive values (this may require that values be adjusted so that the minimum value be greater than 0).
This variation of the s-l plot is appealing in that the slope of the best fit line can suggest a power transformation via \(power = 1 - slope\).
This variant of the s-l plot can be implemented in the eda_sl
function by setting the argument type = "level"
.
eda_sl(iris, Petal.Length, Species, type = "level",
loess.d = list(degree = 1, span = 1.5))
Slope = 1.051237
Note how this plot differs from our earlier s-l plot in that we are only displaying each batch’s median spread value and we are fitting a straight line to the medians instead of connecting them.
The function will return the slope of the line in the console window. Here, the slope is 1.05 which suggests a power of \(1 - 1.05 = -.05\). This is a power transformation very close to the log transformation used earlier in this chapter.
A ggplot
implementation of this variant of the s-l plot is shown next:
<- iris %>%
sl group_by(Species) %>%
summarise (level = log(median(Petal.Length)),
IQR = IQR(Petal.Length), # Computes the interquartile range
spread = log(IQR))
ggplot(sl, aes(x = level, y = spread)) + geom_point() +
stat_smooth(method = MASS::rlm, se = FALSE) +
xlab("Location") + ylab("Spread") +
geom_text(aes(x = level, y = spread, label = Species), cex=2.5)