dplyr | tidyr | ggplot2 | tukeyedar |
---|---|---|---|
1.1.4 | 1.3.1 | 3.5.1 | 0.4.0 |
29 Bivariate residual-fit spread plot
William Cleveland’s residual-fit spread (RFS) plot was introduced earlier in this course in the univariate context. The plot provides a visual means of comparing the variability explained by a fitted model—such as group means—to the variability of the residuals. The RFS plot can also be applied to bivariate models offering insight into the relationship between a model’s effects and its residual variability.
In bivariate analysis, an important objective is to identify a model that maximizes our ability to explain the variability in \(Y\). The greater the variability in \(Y\) explained by the model, the smaller the uncertainty in estimating \(Y\) for each value of \(X\). This objective translates into minimizing the spread of residuals relative to the variability captured by the model.
For example, in the figure below, the plot on the left illustrates a model that effectively explains the variability in \(Y\). Notice the relatively small residuals, represented as vertical dashed lines connecting the fitted (predicted) values (red points) to the observed values (grey points), compared to the variability explained by the model. The model accounts for a range in \(Y\) of approximately 11 units, while the residual magnitudes average about 1 unit, indicating minimal unexplained variability.
In contrast, the plot on the right shows a model with limited explanatory power. Here, the model accounts for only about 2 units of variability in \(Y\), while the residual magnitudes average approximately 4 units. This suggests that a significant portion of the variability in \(Y\) remains unexplained by \(X\).
While a scatter plot can provide insights into the model’s ability to explain variability in extreme cases, it is not well-suited for assessing performance across a broader range of scenarios. In contrast, an RFS plot is specifically designed to evaluate this aspect making it a more effective tool for such assessments.
29.1 Constructing a residual-fit spread plot
Consider the 1st order polynomial fitted to the miles-per-gallon (mpg
) vs. horsepower (hp
) scatter plot from the mtcars
dataset:
The estimatefd values ranges from about 7 miles-per-gallon to about 26 miles-per-gallon for an absolute range of around 19 mpg. The residuals are highlighted by the vertical dashed lines.
An RFS plot can help us assess how much of the variability in mpg
can be explained by the fitted model and how it compares to the residuals.
The RFS plot consists of two side-by-side quantile plots. The left plot, known as the fit-minus-mean plot, displays the quantiles of the estimated mpg
values after subtracting the overall mean (i.e., recentering the estimated mpg
values at 0). The right plot shows the quantiles of the residuals.
Each point in the RFS plot can be tied back to the scatter plot. For example, the estimated mpg
value for the largest hp
value of 335 is 7.24 mpg and its residual is +7.75 mpg. These values are split between the fit-minus-mean quantile plot and the residuals quantile plot.
By plotting the quantiles of both the estimated values and the residuals, the RFS plot allows for the comparison their respective magnitudes.
The fit-minus-mean plot shows estimates covering a range of approximately 19 mpg, which aligns with the observations from the scatter plot. The residuals span a range of about 15 mpg; however, 75% of the residuals had a magnitude of 10 mpg or less. This suggests that the fitted model makes a meaningful contribution to explaining the variability in mpg.
To build an intuitive understanding of the RFS plot, consider an extreme scenario where the model is reduced to just the mean fitted to the data.
Since the mean is a constant, it explains none of the variability in mpg
. As a result, the residuals’ quantile plot is expected to capture the entire range of variability in hp
(approximately 25 units), as illustrated in the following RFS plot.
The greater the proportion of variability explained by the fitted model relative to the residual spread, the stronger the explanatory power of the model.
29.2 Constructing an RFS plot with eda_rfs
The RFS plot can be constructed using the eda_rfs
function from the tukeyedar
package. It can take, as input, a model of class lm
. For example;
library(tukeyedar)
<- lm(mpg ~ hp, mtcars)
M eda_rfs(M)
29.3 Constructing an RFS plot with ggplot
Some data manipulation is needed to construct an RFS plot in ggplot
. Using the regression model M
from the previous code chunk, the RFS dataset can be constructed as follows:
library(dplyr)
library(tidyr)
library(ggplot2)
<- data.frame(Residuals = residuals(M),
df `Fit minus mean` = predict(M) - mean(mtcars$mpg),
check.names = FALSE)
<- df %>%
rf pivot_longer(names_to = "type", values_to = "value", cols=everything()) %>%
group_by(type) %>%
arrange(value) %>%
mutate(fval = (row_number() - 0.5) / n())
ggplot(rf, aes(x = fval, y = value)) +
geom_point(alpha = 0.3, cex = 1.5) +
facet_wrap(~ type) +
xlab("f-value") +
ylab("mpg")