20 Fits and residuals

dplyr	tidyr	ggplot2	lattice	tukeyedar
1.1.4	1.3.1	3.5.1	0.22.6	0.4.0

In the last chapter, we learned of the many benefits that homogeneity in the residuals can offer. Homogeneity in the residuals can also facilitate our assessment in the importance a grouping variable may contribute to an estimation of a variable in a dataset. For example, can educational attainment help determine a person’s income level, or can the implementation of a specific soil treatment improve yield of crops.

This chapter has for objective to evaluate whether conditioning the mean on a grouping variable explains a substantial portion of the variability by comparing its influence to that of the residuals using a residual-fit spread plot.

20.1 Introduction

An overarching goal in data analysis is to summarize data with mathematical expressions that effectively characterize its values. A basic example of such a summary is the overall mean. For instance, consider a dataset of 60 values (visualized in the following jitter plot). These values can be characterized by a mean, \(\mu\) of 11 units (depicted as a bisque color point in the plot) and the residuals, \(\epsilon\).

\[ y = \mu + \epsilon \]

While the overall mean serves as an initial estimate for \(y\), it is accompanied by some degree of uncertainty. In this case, the variability spans approximately four units.

To refine our estimate of \(y\), we can incorporate additional information, such as the group to which each measurement belongs. This allows us to account for group-level differences and reduce the uncertainty in our estimate of \(y\).

\[ y = \mu_{group} + \epsilon \] For example, dividing the measurements into three groups provides separate estimates of, \(\mu_{group}\) for each group.

In this example, knowing the group membership of an observation produces three distinct estimates of \(y\) that range from about 10 to 12 units. This approach not only helps us hone in on a more precise measure of location, but it also reduces the uncertainty in our estimate of \(y\) by about two units–an improvement over the earlier model which relied on a single estimate for all observations.

A grouping variable’s usefulness in improving our estimate of \(y\) depends not only on its ability to provide distinct group-level estimates of the population mean but also on its potential to reduce the uncertainty surrounding those estimates. In the following jitter plot, we see an example where splitting the data into groups yields little improvement in the precision of our estimates compared to the simpler overall mean model:

Here, the improvement in \(y\) estimates is modest because the uncertainty within each group (\(\epsilon\)) remains substantial. Each group exhibits a range of approximately four units, comparable to the range of uncertainty in the overall mean model.

20.2 The residual-fit spread plot

The residual-fit spread (rfs) plot, introduced by William Cleveland, is a visualization tool designed to compare the variability explained by a fitted model (e.g., group means in univariate analysis) to the variability in the residuals. The plot is constructed as follows:

Fit a model \(\mu_{group}\), such as the mean or median, to each group in the data.
Compute residuals by subtracting the fitted values from the original values (\(\epsilon_{group} = y_{group} - \mu_{group}\)).
Center the fitted values (e.g., subtract the overall mean) to align both datasets around zero(\(\mu'_{group} = \mu_{group} - \mu_{overall}\)).
Generate side-by-side quantile plots for the centered fitted values, \(\mu'_{group}\), and residuals \(\epsilon_{group}\).

While a residual distribution having consistent shape and spread across groups is not explicitly assumed in creating an rfs, it does ensure valid comparisons.

Using the dataset introduced earlier in this chapter, we first calculate the group means of 10, 11 and 12. The data are then split into two parts: the modeled means and the residuals. For instance, an observation \(y = 11.07\) (highlighted in green in the figure below) can be decomposed as follows:

\[ 11.07 = 12 - 0.93 \]

Here, 12 is the group mean, \(\mu_{C}\), and -0.93 is the residual value \(\epsilon_{i, C}\).

The decomposed values are plotted as side-by-side quantile plots: the fitted means on the left and the residuals on the right. To facilitate a direct comparison of spreads, the fitted values are re-centered around zero by subtracting the overall mean (\(\mu_{overall} = 11\)) from each fitted value (\(\mu_{group}\)). For the highlighted observation, this yields \(\mu'_{C} = \mu_{C} - \mu_{overall} = 12 - 11 = 1\)

In the rfs plot, both spreads are shown on a shared y-axis, centered at zero. This alignment enables a straightforward comparison of variability between the fitted values and the residuals. In this example, the amount of variability captured by the fitted values (the means) is in the same order of magnitude as the variability that remains unexplained by the model (the residuals).

Given that the tails of the residual distribution can sometimes be more variable or influenced by a few extreme values, it’s best to focus the spread comparison in the rfs plot on the central bulk of the residuals, such as the range encompassing the inner 90% or 95% of the residuals. This approach provides a more stable comparison of the typical variability unexplained by the model against the variability captured by the fitted values.

20.3 Exloring extreme scenarios with an rfs plot

To build intuition for interpreting the residual-fit spread (rfs) plot, let’s consider two extreme cases.

20.3.1 Scenario 1: Maximizing group separation

In this scenario, the \(y\) values are assigned to groups so as to maximize separation between them:

The variability within each group is reduced to around 1.25 units–significantly less than the overall variability of 4 units. This dataset generates the following rfs plot:

Here, the group means explain a substantial portion of the variability in \(y\). The spread in fitted group means covers 2 units, while approximately 95% of the residuals fall within a range of 1 unit–a much smaller spread. This indicates that the grouping variable can play a key role in explaining \(y\).

20.3.2 Scenario 2: Minimizing group separation

Now consider the opposite extreme, where the grouping variable is minimally effective at explaining \(y\). The original \(y\) values remain unchanged, but the grouping assignments are restructured to minimize differences between group means:

Here the group means differ by less than 0.3 units whereas the residuals span three to four units. The resulting rfs is as follows:

In this scenario, the residuals dominate, and the grouping variable contributes little to improving our estimates of \(y\). The small spread in fitted values compared to the residuals indicates that the grouping variable has minimal impact. If we were to abandon the grouping variable and rely solely on the overall mean, the residual spread would remain nearly unchanged as shown in the following rfs plot.

20.4 Generating an rfs plot with `eda_rfs`

The rfs plot can be generated with the eda_rfs function from the tukeyedar package. But first, we’ll generate the singer height data used in the previous chapters.

df <- lattice::singer

The rfs plot for the singer height data follows:

eda_rfs(df, height, voice.part)

In addition to generating a plot, the function will output information about the spreads in the console. It compares the range of values associated with the mid 90% of the residuals to the spread of the fitted values.

The mid 90.0% of residuals covers about 7.98 units.
The fitted values cover a range of 7.42 units, or about 93.0% of the mid 90.0% of residuals.

The reason the mid 90% of values is chosen is to prevent outliers, or extreme values, in the residuals from disproportionately exaggerating the spread in residuals. For example, you’ll note several extreme residual values above 5 inches in the working example.

To help visualize the inner 90% of values, you can add the q=TRUE argument to the function. This option generates a shaded box highlighting the inner 90% of residual values and its matching range in the Fit minus mean plot.

eda_rfs(df, height, voice.part, q = TRUE)

The spread of the fitted heights (across each voice part) is not insignificant compared to the spread of the combined residuals. The spread in the fitted means spans the same range of the mid 90% of the residual values.

20.5 Generating an rfs plot with `ggplot`

To generate the R-F plot using ggplot2, we must first split the data into its fitted and residual components. We’ll make use of dplyr and tidyr functions to tackle this task.

library(dplyr)
library(tidyr)

rf <- df %>%
  mutate(norm = height - mean(height)) %>%   # Normalize values to global mean
  group_by(voice.part) %>% 
  mutate( Residuals  = norm - mean(norm),    # Extract group residuals
          `Fit minus mean` = mean(norm))%>%   # Extract group means
  ungroup() %>% 
  select(Residuals, `Fit minus mean`) %>% 
  pivot_longer(names_to = "type",  values_to = "value", cols=everything()) %>% 
  group_by(type) %>% 
  arrange(value) %>% 
  mutate(fval = (row_number() - 0.5) / n())

Next, we plot the data.

library(ggplot2)
ggplot(rf, aes(x = fval, y = value)) + 
  geom_point(alpha = 0.3, cex = 1.5) +
  facet_wrap(~ type) +
  xlab("f-value") +
  ylab("Height (inches)")