31  Resistant lines

dplyr ggplot2 MASS
1.1.4 3.5.1 7.3.61

31.1 Introduction

Ordinary least squares regression lines (those created using the lm() function) are known to be highly sensitive to outliers. This sensitivity arises because OLS regression minimizes the sum of squared residuals, which gives disproportionate weight to data points with large deviations from the predicted line. Consequently, even a single outlier can significantly distort the slope and intercept of the best-fit line.

This vulnerability is tied to the use of the mean, a central measure that is not robust to outliers. Specifically, the breakdown point of OLS regression is \(1/n\), meaning that the presence of just one outlier in a dataset with \(n\) points can greatly influence the regression results.

To illustrate this, consider a simple dataset where all points are perfectly aligned along a line. By fitting this dataset with an OLS regression line, we can observe its behavior in an ideal scenario. Then, by introducing an outlier, we can see how drastically the line changes. This demonstrates the need for regression techniques that are more resistant to outliers.

x <- seq(1:11)
y <- 5 + 2 * x
plot(y ~ x, pch = 20)
M <- lm(y ~ x)
abline( M , col = "red")

As expected, we have a perfect fit. And the regression model’s coefficients match those used to create the data.

coef(M)
(Intercept)           x 
          5           2 

Note what happens to the fitted line when a single point deviates significantly from the linear pattern.

y[11] <- 2
plot(y ~ x, pch = 20)
M <- lm(y ~ x)
abline( M , col = "red")

The model’s intercept and slope are now:

coef(M)
(Intercept)           x 
  9.5454545   0.8636364 

The slope dropped from 2 to 0.86 because of a single point!

When our goal is to understand the trend of the data and what the majority of points reveal about the phenomenon under investigation, we cannot afford to let a few “maverick” values dominate the analysis. Outliers, while sometimes interesting, can distort results in ways that obscure the broader patterns. To address this, we need fitting tools that reduce or eliminate the undue influence of such outliers.

Fortunately, several methods are available for this purpose. Among them, Tukey’s resistant line and the bisquare robust estimation are alternatives to traditional regression techniques. These methods will be explored here to demonstrate their effectiveness in focusing on the “bulk” of the data while handling outliers gracefully.

  

31.2 Robust lines

  

31.2.1 Tukey’s resistant line

The idea is simple in concept. It involves dividing the dataset into three approximately equal groups of values and summarizing these groups by computing their respective medians. The batches’ median values are then used to compute the slope and intercept. The motivation behind this plot is to use the three-point summary to provide a robust assessment of the type of relationship between both variables. Let’s look at an example of its implementation. We’ll continue with the example used in the previous section. First, we divide the plot into three approximately equal batches.

Next, we compute the median x and y values for each batch. The medians are presented as red points in the following plot.

The x median values are 2.5, 6, 9.5 and the y median values are 10, 17, 22 respectively.

The two end medians are used to compute the slope as:

\[ b = \frac{y_r - y_l}{x_r-x_l} \] where the subscripts \(r\) and \(l\) reference the median values for the right-most and left-most batches.

Once the slope is computed, the intercept can be computed as follows:

\[ median(y_{l,m,r} - b * x_{l,m,r}) \]

where \((x,y)_{l,m,r}\) are the median x and y values for each batch.

This is not bad for a first iteration. But the outlier’s influence is still present given that the line is not passing through all nine points. Recall that the goal is to perevent a lone observation from influencing a fitted line.

From this first iteration, we can extract the residuals. A line is then fitted to the residuals following the same procedure outlined above.

The initial model intercept and slope are 6.429 and 1.714 respectively and the residual’s slope and intercept are -1.224 and 0.245 respectively. The residual slope is added to the first computed slope giving us an updated slope of 1.959 and the process is again repeated thus generating the following tweaked slope and updated residuals:

The second set of residuals is added to the previously computed slope giving us an updated slope of 1.994. The iteration continues until the fitted line’s residuals stabilize. Each successive iteration is shown in the following plot with the final line in dark red.

Three iterations were required to stabilize the residuals giving us the final fitted line whose intercept and slope are 5.03 and 1.99 respectively–very close to the actual relationship between the two variables.

Tukey’s resistant line can be computed using the R built-in line() function. The maximum number of iterations is set with the iter argument. Usually, five iterations will be enough.

r.fit <- line(x,y, iter = 5)

You can add the resistant line to a base plot as follows:

plot(y ~ x, df)
abline(r.fit, col = "red")

You can also add the resistant line to a ggplot using the geom_abline() function:

library(ggplot2)
ggplot(df, aes(x,y)) + geom_point() +
  geom_abline(intercept = r.fit$coefficients[1], 
              slope = r.fit$coefficients[2],
              col = "red")

31.2.2 Bisquare

The bisquare fitting strategy employs a robust regression technique that assigns weights to observations based on their residuals relative to the center of the data distribution. Observations closer to the center are given higher weights, while those farther away receive progressively smaller weights, reducing their influence on the model.

The process begins with an initial linear regression model fitted to the data. Residuals from this model are then computed, and a weighting scheme is applied: observations with smaller residuals are assigned larger weights, while those with larger residuals (indicating potential outliers) are assigned smaller weights. A new regression is then performed using these weights, effectively minimizing the influence of outliers on the fitted line.

This iterative process is repeated, with weights recalculated after each regression until the residuals stabilize and no further significant changes are observed. The figure below illustrates three iterations of the bisquare function, showing how weights (displayed as grey text next to each point) evolve. Initially, all weights are set to 1, but they are progressively adjusted in response to the residuals derived from the most recent regression model.

There are different weighing strategies that can be implemented. One such implementation is presented next.

# Create the bisquare function
wt.bisquare <- function(u, c = 6) {
   ifelse( abs(u/c) < 1, (1-(u/c)^2)^2, 0)
}

# Assign an equal weight to all points
wt <- rep(1, length(x))

# Compute the regression, then assign weights based on residual values
for(i in 1:10){
  dat.lm <- lm(y ~ x ,weights=wt)
  wt <- wt.bisquare( dat.lm$res/ median(abs(dat.lm$res)), c = 6 )
}

# Plot the data and the resulting line
plot(x, y, pch = 20)
abline(dat.lm, col = rgb(1,0,0,0.3))

31.2.2.1 Built-in implementation of the bisquare

The MASS package has a robust linear modeling function,rlm, that implements a variation of the bisquare estimation technique described in this chapter. Its results may differ slightly from those presented above, but the difference will be insignificant for the most part.

Note that if you make use of dplyr in a workflow, loading MASS after dplyr will mask dplyr’s select function. This can be problematic. You either want to load MASS before dplyr, or you can call the function via MASS::rlm. The latter option is adopted in the following code chunk.

dat <- data.frame(x,y)
M   <- lm( y ~ x, dat=dat)
Mr  <- MASS::rlm( y ~ x, dat=dat, psi="psi.bisquare")
plot(y ~ x,dat, pch=16, col=rgb(0,0,0,0.2))

# Add the robust line
abline( Mr, col="red")

# Add the default regression line for reference
abline( M , col="grey50", lty=2)

The default regression model is added as a dashed line for reference.

The function rlm can also be called directly from within ggplot.

library(ggplot2)
ggplot(dat, aes(x=x, y=y)) + geom_point() + 
       stat_smooth(method = MASS::rlm, se = FALSE, col = "red",
                   method.args = list(psi = "psi.bisquare"))

  

31.3 Robust loess

The loess fit can also be sensitive to outliers. The bisquare estimation method can be implemented in the loess smoothing function by passing the "symmetric" option to the family argument.

# Fit a regular loess model
lo <- loess(y ~ x, dat)

# Fit a robust loess model
lor <- loess(y ~ x, dat, family = "symmetric")

# Plot the data
plot(y ~ x, dat, pch = 16, col = rgb(0,0,0,0.2))

# Add the default loess
lines(dat$x, predict(lo), col = "grey50", lty = 2)

# Add the robust loess
lines(dat$x, predict(lor), col = "red")

The robust loess is in red, the default loess fit is added as a dashed line for reference.

The robust loess can be implemented in ggplot2’s stat_smooth function as follows:

library(ggplot2)
ggplot(dat, aes(x=x, y=y)) + geom_point() + 
          stat_smooth(method = "loess", 
                      method.args = list(family="symmetric"), 
                      se=FALSE,  col="red") 

The loess parameters are passed to the loess function via the method.args argument.