26  Model residuals

ggplot2
3.5.1

In our earlier exploration of univariate analysis, we learned how residuals could provide insights into the underlying patterns and processes. Examining the relationship between residuals and fitted models proved essential for answering many exploratory data analysis (EDA) questions and refining our understanding of the data’s structure.

Residuals play an equally pivotal role in bivariate analysis. At their core, residuals help evaluate and refine model fitting strategies. Beyond this, they are indispensable in confirmatory analysis, where they serve as tools to assess the significance and predictive power of an independent variable in explaining variations in a continuous dependent variable. In this way, residuals bridge the gap between model diagnostics and the broader goals of statistical inference.

In this chapter, we will explore ways residuals can help fine-tune the formulation and fitting of models.

26.1 Residual-dependence plot

Regardless of the type of model, it is uncommon for a model to fit every data point perfectly, even with non-parametric approaches like loess In fact, achieving a perfect fit can be undesirable when the goal is to gain insights into the underlying process, as noise is often an inherent byproduct of measurements.

The residuals reflect the difference between the observed value ,\(y_{observed}\), and what the model predicts, \(y_{predicted}\) .

\[ \epsilon = y_{observed} - y_{predicted} \]

In the miles-per-gallon vs. horsepower dataset exlored in the previous chapter, a few models were explored including the first order polynomial model.

\[ mpg = a + b(hp) + \epsilon \]

The lm function is used to define the model.

M1 <- lm(mpg ~ hp, mtcars)

The M1 model is a list that stores many model components including the residuals which can be extracted from the model using the residuals() function. The output is a numeric vector that can be added to the dataframe as follows:

mtcars$residuals1 <- residuals(M1)

The residuals can be explored in a residual-dependence plot (R-D plot) where the residuals are mapped to the y-axis and the dependent variable is mapped to the x-axis. An example of the plot is generated using ggplot. To help identify a pattern (in any), a loess is fitted to the residuals

library(ggplot2)

ggplot(mtcars, aes(x = hp, y = residuals1)) + geom_point() +
             stat_smooth(method = "loess", se = FALSE, span = 1, 
                         method.args = list(degree = 1) )

A well defined model will result in a residuals point pattern that should appear uniformly distributed across the range of dependent values. The fitted loess should show no prominent trends in the residuals.

The first-order polynomial residuals follow a parabolic suggesting that the linear model did not capture the true . We say that the residuals show dependence on the x values.

Next, we’ll look at the residuals from the second order polynomial model M2.

M2 <- lm(mpg ~ hp + I(hp^2), mtcars)

mtcars$residuals2 <- residuals(M2)

ggplot(mtcars, aes(x = hp, y = residuals2)) + geom_point() +
               stat_smooth(method = "loess", se = FALSE, span = 1, 
                           method.args = list(degree = 1) )

The parabolic pattern observed in the first-order model residuals is no longer present in this set of residuals. This suggests that the second-order model does a better job in capturing the pattern generated by this bivariate dataset.

26.2 Residual-fit plot

A residual-fit plot (R-F plot) is a type of residual plot where the modeled (or fitted) values, \(y_{predicted}\), are plotted on the x-axis, and the residuals, \(\epsilon\), are plotted on the y-axis. Similar to the residual-dependence plot, the R-F plot visually conveys how closely the observed points align with the modeled relationship.

mtcars$fit2 <- fitted(M2)

ggplot(mtcars, aes(x = fit2, y = residuals2)) + geom_point() +
               stat_smooth(method = "loess", se = FALSE, span = 1, 
                           method.args = list(degree = 1) ) +
               xlab("Fitted values")

The distinction between an R-F plot and an R-D plot is often minimal, especially when the fitted model is a first-order polynomial. For instance, if the modeled slope is positive, both plots will exhibit nearly identical point patterns. In the following plots, points are colored consistently across the three visualizations to highlight their alignment:

As shown, the point patterns in both residual plots are identical.

If the modeled slope is negative, the point pattern in the R-F plot becomes a mirror image of the R-D plot. This is evident in the re-arrangement of the colored points in the R-F plot:

Despite the flipped orientation, the insights derived from either plot, such as potential model adjustments, remain consistent.

When the fitted model is curvilinear, however, the differences between the two residual plots become more pronounced and potentially meaningful. This is demonstrated in the following series of plots:

Both residual plots suggest that the second-order polynomial model is a good fit for the data. However, the point patterns in the two plots differ significantly, as highlighted by the reorganization of the colored points in the R-F plot.

One notable difference is the clearer visibility of heteroskedasticity in the R-D plot. Specifically, the increasing spread of residuals with larger \(x\) values is more evident, while it is less noticeable in the R-F plot. This phenomenon, known as heteroskedasticity, will be discussed in detail in the next chapter, along with techniques and plots designed to better identify and analyze it.

26.3 Where are the residuals, exactly?

You might wonder, “How can we determine the spread along the fitted line when, for any given value of \(X\), there may be at most a single observation?” To address this, it’s important to recognize that most datasets represent a finite set of observations. However, when fitting a model to bivariate data, we implicitly assume that the relationship between \(X\) and \(Y\) is continuous within the range of observed values. This assumption allows us to generalize beyond the specific data points collected.

For example, when modeling the relationship between miles-per-gallon and horsepower, our analysis isn’t restricted to just the vehicles in the dataset. Instead, we aim to capture a broader relationship that applies to all vehicles. The expectation is that if a new vehicle were engineered to produce a given horsepower, the fitted model could estimate its expected fuel efficiency, even though this specific vehicle wasn’t part of the original dataset. This concept of spread reflects the variability we would observe if multiple values of \(Y\) were recorded for the same \(X\). While the model provides an expected value for \(Y\), real-world observations rarely align perfectly with this expectation. For instance, if 10 vehicles were designed to produce the same horsepower, their observed fuel efficiency would likely vary due to factors like manufacturing differences or environmental conditions. This variability means Y is treated as a “random variable,” subject to influences beyond the fitted model.

In practice, we rarely have the luxury of observing multiple \(Y\)-values for the same \(X\), let alone for many \(X\)-values across the range. Instead, we rely on the distribution of residuals to approximate this variability. The spread of residuals across the range of \(X\) serves as a key indicator of the overall error in the model and helps us assess the consistency of the relationship it describes.

This is why the assumption of uniform variability (or homoscedasticity) is so important: it allows us to use all residuals to estimate a single measure of dispersion. By pooling information from the entire range of \(X\), we increase the reliability of this estimate, enhancing our ability to draw meaningful conclusions about the model’s performance.