19  Fits and residuals

dplyr tidyr ggplot2 lattice
1.1.4 1.3.1 3.4.4 0.22.5

In the previous chapters, we’ve determined that the voice.part singer groups differed only by location (central value) and not so much by spread. In this section, we will expand this analysis by fitting a model (the mean) to the data, then we’ll explore the residuals (i.e. the part of the data not explained by the fitted model). This exercise will tackle two objectives:

19.1 Fitting the data

Univariate data can be characterized by their location and by their spread. The different groups of singers differ by their central values, we will therefore fit the group means to each group batch and compare the residuals between groups.

First, we’ll load the libraries that will be used in this chapter, then we’ll load the singer data into the df object.

library(dplyr)
library(tidyr)
library(ggplot2)
library(lattice)

df <- singer

Next, we’ll plot the singer values using jittered points. We’ll also add an orange point to each batch which will represent each group’s mean.

ggplot(df, aes(y = height, x = voice.part)) + 
  geom_jitter(width = 0.1, height = 0, alpha = 0.1) +
  stat_summary(fun = "mean", geom = "point", cex = 3, pch = 21, col = "red", bg = "orange") 

We’ve fitted each group with the mean–a mathematical description of the batches. Note that we could have used other measures of location such as the median, but since the data seem to follow a symmetrical distribution, the mean remains an adequate choice.

19.2 Computing the residuals

Now, we’ll subtract the group means from their respective group values: this will give us the residuals for each batch.

# Add residual values to the data
df2 <- df %>% 
  group_by(voice.part) %>%
  mutate(Height.res = height - mean(height))

Next, we will generate a plot of the (jittered) residuals.

# Now plot the data after fitting height with group mean
ggplot(df2) + aes(y=Height.res, x=voice.part)             + 
  geom_jitter(width = 0.1, height=0, alpha=0.1) +
  stat_summary(fun = "mean", geom = "point", cex = 3, pch = 21, col = "red", bg="orange") 

We’ve normalized the batches to a common location. Note that the values along the y-axis have changed: all values are now spread around 0. Next, we’ll check that the batches of residuals have similar spread.

19.3 Comparing the residuals

The feature that interests us in the residuals is the spread. We’ve learned that a good way to compare spreads is to plot the quantiles of each batch against one another.

19.3.1 Residual q-q plots

If we want to compare all batches of residuals, we can create a matrix of pairwise residual q-q plots. We’ll adopt the same code chunk used to generate the pairwise empirical q-q plots in chapter 17.

min_size     <- min(tapply(df2$Height.res, df$voice.part, length)) 
singer_split <- split(df2$Height.res, df$voice.part)
rng          <- range(df2$Height.res) 
 
qq_df <- as.data.frame(lapply(singer_split, 
                              function(x) quantile(x, type = 5,
                                                   p = (1:min_size -0.5)/min_size) ))

plotfun = function(x,y, ...){
points(x,y,pch=18)
  abline(c(0,1),col="blue")
}

pairs(qq_df, upper.panel=NULL, panel = plotfun, xlim = rng, ylim=rng) 

Since we removed the means from each batch of values, each pair of values should no longer display any significant offsets. This facilitates our comparison of the spreads and allows us to focus just on the multiplicative offsets.

The residual q-q plots suggest that the spreads are very similar across singer heights given that the points fall almost perfectly along the one-to-one line.

19.3.2 Comparing batches to pooled residuals using a q-q plot

Since the spreads are homogeneous across the batches, we may choose to combine (pool) the residuals and compare the residuals of each batch to the pooled residuals. The advantage with this approach is that we are increasing the size of the reference residual distribution thus reducing noise that results from a relatively small sample size. It also reduces the number of q-q plots to analyze–even gone from 28 plots to just eight!

df3 <- df2 %>%
  group_by(voice.part)  %>%
  arrange(Height.res)  %>% 
  mutate(f.val    = (row_number() - 0.5) / n())  %>%
  ungroup()  %>%
  mutate(Pooled.res = quantile(Height.res, probs = f.val))  %>%
  select(voice.part, Height.res, Pooled.res)

ggplot(df3, aes(y = Height.res, x = Pooled.res)) + geom_point(alpha = 0.5) + 
              geom_abline(intercept = 0, slope = 1) +
              facet_wrap(~ voice.part, nrow = 1) 

All eight batches seem to have similar spreads. This makes it possible to compare batch means using a residual-fit spread plot (covered later in this chapter).

19.3.2.1 What to expect if one or more of the batches have different spreads

The residual vs. pooled residual plots can be effective at identifying batches with different spreads. In the following example, we combine four simulated batches generated from an identical distribution (V1, V2, V3 and V4) with two simulated batches generated from a different distribution (V5 and V6). Their boxplots are shown next.

Now let’s take a look at the residual vs. pooled residual plots.

Batches V5 and V6 clearly stand out as having different distributions from the rest of the batches. But it’s also important to note that V5 and V6 contaminate the pooled residuals. This has the effect of nudging the other four batches away from the one-to-one line. Note what happens when batches V5 and V6 are removed from the pooled residuals.

The tightness of points around the one-to-one line suggests nearly identical distributions between V1, V2, V3 and V4 as would be expected given that they were generated from the same underlying distribution.

Performing simulations like this can help understand how a pooled residual q-q plot may behave under different sets of distributions.

19.4 Residual-fit spread plot

So far, we’ve learned that the spreads of singer heights are the same across all batches. This makes it feasible to assess whether the differences in mean heights between voice parts are comparable in magnitude to the spread of the pooled residuals.

19.4.1 A simple example

First, let’s compare the following two plots. Both plots show two batches side-by-side. The difference in location is nearly the same in both plots (group a and b have a mean of 10 and 11 respectively), but the difference in spreads are not.

Plot 2 does not allow us to say, with confidence, that the two batches differ significantly despite both means being different. Plot 1 on the other hand, shows a significant difference in batch locations. One cannot make inferences about differences in central values without knowing the batches’ distributions.

For example, in Plot 1, the spread (or difference) in mean values is relatively large compared to the spread of the residuals for each group (note that the spreads are nearly identical between both batches a and b). The difference in means spans one unit while the spread of each sets of residuals spans about the same amount. So the difference in location is significant and is very likely not due to chance alone. The same cannot be said for Plot 2.

If we split each batch in Plot 1 into a location component plot (normalized to the overall mean) and a pooled residual component plot, and then compare those values against a quantile, we get a residual-fit spread plot, or r-f spread plot for short.

It’s clear from this r-f spread plot that the spread of the mean distribution (between batches a and b) is important compared to that of its residuals. This suggests that the groups a and b explain much of the variability in the data.

For Plot 2, the difference in mean values is also one unit, but the spread of residuals spans almost 5 units. An r-f spread plot makes this difference quite clear.

The spread between each batch’s fitted mean is small compared to that of the combined residuals suggesting that much of the variability in the data is not explained by the differences between groups a and b for Plot 2.

19.4.2 Are the fitted voice part values significantly different?

To generate the r-f plot, we first need to normalize the data to the global mean. We then split the normalized singer height data into two parts: the modeled means and the residuals. For example, the smallest value in the Bass 2 group is 66. When normalized to the global mean, that value is -1.29. The normalized value is then split between the group (normalized) mean of 4.1 and its residual of -5.39 (i.e. the difference between its value and the Bass 2 group mean). These two values are then each added to two separate plots: the fitted values plot and the residuals plot. This process is repeated for each observation in the dataset to generate the final r-f spread plot.

To generate the R-F plot using ggplot2, we must first split the data into its fitted and residual components. We’ll make use of piping operations to complete this task.

df4 <- singer %>%
  mutate(norm = height - mean(height)) %>%   # Normalize values to global mean
  group_by(voice.part) %>% 
  mutate( Residuals  = norm - mean(norm),    # Extract group residuals
          `Fitted values` = mean(norm))%>%   # Extract group means
  ungroup() %>% 
  select(Residuals, `Fitted values`) %>% 
  pivot_longer(names_to = "type",  values_to = "value", cols=everything()) %>% 
  group_by(type) %>% 
  arrange(value) %>% 
  mutate(fval = (row_number() - 0.5) / n()) 

Next, we’ll plot the data.

ggplot(df4, aes(x = fval, y = value)) + 
  geom_point(alpha = 0.3, cex = 1.5) +
  facet_wrap(~ type) +
  xlab("f-value") +
  ylab("Height (inches)") 

An alternative to the side-by-side r-f plot is one where both fits and residuals are overlapping.

ggplot(df4, aes(x = fval, y = value, col = type)) + 
  geom_point(alpha = 0.3, cex = 1.5) +
  xlab("f-value") +
  ylab("Height (inches)") 

The spread of the fitted heights (across each voice part) is not insignificant compared to the spread of the combined residuals. The spread in the fitted values (aka the means) encompasses about 90% of the spread in the residuals (you can eyeball the percentage by matching the upper and lower mean values with the residuals’ f-values). So height differences between singer groups cannot be explained by random chance alone or, put another way, the voice-parts can explain a good part of the variation in the data!

19.5 Comparing pooled residuals to the normal distribution

Our exploration of the singer height batches have been visual thus far. But there may be times when the analysis may need to culminate in a statistical test. Some of these tests reduce the data to mathematically tractable models such as the mean and the standard deviation.

We’ll take advantage of the pooled residuals to give us a larger sample size for comparison with the theoretical normal distribution.

# Find the equation for the line
ggplot(df3, aes(sample = Pooled.res)) + stat_qq(distribution = qnorm) + 
  geom_qq_line(distribution = qnorm)

This dataset has behaved quite well. Its batches differed only by location, yet its spread remained homogeneous (enough) across the batches to pool them and enable us to confirm, with greater confidence, that the spread follows a normal distribution.

This well behaved dataset allows us to model its spread using the sample standard deviation. It’s important to note that had the data not followed a normal distribution, then characterizing its spread using the standard deviation would have been inappropriate. Unfortunately, many ancillary data analysts seldom check the distribution requirements of their data before choosing to characterize its distribution using the standard deviation. In such a situation, you would have to revert to a far less succinct characterization of spread: the quantile.

You can compute the standard deviation as:

sd(df2$Height.res)
[1] 2.465049

We can now model singer height by both voice.part means, and the group standard deviation of 2.47.