18  The theoretical q-q plot

dplyr ggplot2 lattice tukeyedar
1.1.4 3.4.4 0.22.5 0.2.3

18.1 Introduction

Thus far, we have used the quantile-quantile plots to compare the distributions between two empirical (i.e. observational) datasets–hence the name empirical q-q plot. We can also use the q-q plot to compare an empirical distribution to a theoretical distribution (i.e. one defined mathematically). Such a plot is usually referred to as a theoretical Q-Q plot. Examples of popular theoretical distribution are the normal distribution (aka the Gaussian distribution), the chi-square distribution, and the exponential distribution just to name a few.

There are many reasons we might want to compare empirical data to theoretical distributions:

  • A theoretical distribution is easy to parameterize. For example, if the shape of the distribution of a batch of numbers can be approximated by a normal distribution we can reduce the complexity of our data to just two values: the mean and the standard deviation.

  • If data can be approximated by certain theoretical distributions, then many mainstream statistical procedures can be applied to the data.

  • In inferential statistics, knowing that a sample was derived from a population whose distribution follows a theoretical distribution allows us to derive certain properties of the population from the sample. For example, if we know that a sample comes from a normally distributed population, we can define confidence intervals for the sample mean using a t-distribution.

  • Modeling the distribution of the observed data can provide insight into the underlying process that generated the data.

But very few empirical datasets follow any theoretical distributions exactly. So the questions usually ends up being “how well does theoretical distribution X fit my data?”

The theoretical quantile-quantile plot is a tool to explore how a batch of numbers deviates from a theoretical distribution and to visually assess whether the difference is significant for the purpose of the analysis. In the following examples, we will compare empirical data to the normal distribution using the normal quantile-quantile plot or normal q-q plot for short.

18.2 The normal q-q plot

If we wish to assess whether a batch of values follows the shape of a normal distribution, we can compare it to its matching normal quantile function,\(q_{\mu,\sigma}(f)\). For example, given a batch of values \(x\), we can generate the matching normal quantile using \(x\)’s mean, \(\mu\) and standard deviation, \(\sigma\). The following figure shows the overlapping density plots (left plot) and q-q plot (right plot).

The traditional “bell” shape of the normal distribution (red density plot) is apparent in the left plot. You will seldomly expect any set of observational data to follow a normal distribution exactly. Our interest is in knowing how close a batch of values comes to a normal distribution.

Constructing the above q-q plot requires that we extract the mean and standard deviation from \(x\). This can be avoided by noting that the normal quantile, \(q_{\mu,\sigma}(f)\), can be decomposed into its mean an standard deviation components \(\mu + \sigma q_{0,1}(f)\) where \(q_{0,1}(f)\) is the unit normal quantile (i.e. a normal quantile whose mean is \(0\) and standard deviation is \(1\) unit). So, if we compare \(x\) to a unit normal quantile we get the following q-q plot:

You’ll note both the additive and multiplicative offsets. The additive offset is nothing more that the sample mean, \(\mu\), and the multiplicative offset is nothing more than the sample standard deviation, \(\sigma\).

We know that if two batches differ only by their offsets, and not their overall shape, then we expect the points to follow a straight line. Knowing this, we can modify the q-q plot by plotting \(x\) against the unit normal quantile, \(q_{0,1}(f)\). The \(x=y\) slope used in the empirical q-q plot no longer applies here given that the \(x\) and \(q_{0,1}(f)\) scales will not necessarily match (the \(q_{0,1}(f)\) will usually range from -2 to 2). Instead, we fit a line to the points to help gauge the pattern’s straightness. This gives us the unit normal q-q plot. Note that the word “unit” is often dropped from the plot name and is therefore often labeled as the normal q-q plot.

The x-axis can help identify the tails of the distribution by noting that roughly 68% of the values fall between -1 and 1 standard deviations and that roughly 95% of the values fall between -2 and 2 standard deviations.

We’ll first learn how to generate this plot using the built-in R function, then we’ll do the same with the ggplot2 package, and then using the tukeyedar package.

18.2.1 Using R’s built-in functions

In the following example, we’ll compare the Alto 1 group to a normal distribution.

library(dplyr)

df <- lattice::singer

alto  <- df %>% 
  filter(voice.part == "Alto 1") %>% 
  pull(height)

Note that alto is a single vector element. We’ll use two built-in functions to generate a normal q-q plot: qqnorm and qqline.

qqnorm(alto)
qqline(alto)

There are many ways one can fit a line to the data, Here, we opt to fit a line to the first and third quartiles (IQR) of the q-q plot. Note that the qqline function is defaulting to an f-value type of 7. This can be changed via the qtype argument. But note that in practice, the choice of quantile type will have little importance in discerning the straightness of the point pattern.

18.2.2 Using the ggplot2 plotting environment

To generate the theoretical q-q plot in ggplot, we first use the stat_qq function to generate the point plot, then we call the stat_qq_line function to generate the IQR fit. Here, we are passing a single vector object instead of a dataframe to ggplot().

library(ggplot2)

ggplot() + aes(sample = alto) + stat_qq(distribution = qnorm) +
  stat_qq_line(col = "blue") +
  xlab("Unit normal quantile") + ylab("Height")

Note the slight difference in syntax used with ggplot when passing a vector instead of a dataframe to the function. Here, we take the aes() function outside of the ggplot() function. This is done to render a cleaner syntax. The alternative, ggplot(,aes(sample = alto)), would make it difficult to notice the comma just before aes(), thus increasing the chance for a typo.

The stat_qq_line function uses the built-in quantile function and, as such, will adopt the default quantile type 7 (i.e. it computes the f-value as \((i - 1)/(n - 1))\). This setting cannot be changed in stat_qq_line.

Note that geom_qq and geom_qq_line functions are identical to stat_qqand stat_qq_line.

18.2.3 Using the custom eda_qq function

You were introduced to the eda_qq custom function in the previous chapter. This function can also be used to generate normal q-q plots by setting norm to TRUE. In such a case, the function takes as input a single vector of values.

library(tukeyedar)
eda_qq(alto, norm=TRUE) 

The function defaults to the quantile type 5. To adopt the default quantile type used in ggplot2, set q.type = 7.

Note that when the eda_qq function is used to generate a normal q-q plot, the light dashed lines highlight the standard deviation for both sets of values. This differs from the mid 75% representation of values adopted by the function when generating an empirical q-q plot.

18.3 How normal is my dataset?

The alto batch of values seem to do a good job in following a normal distribution given how well they follow a straight line. The stair-step pattern in the points is simply a byproduct of the rounding of height values to the nearest inch. A few observations at the tail ends of the distribution deviate from normal, but this is to be expected given that tail ends of distributions tend to be noisy.

So, how do the other singer groups compare to a normal distribution? We’ll make use of ggplot’s faceting function to generate all eight normal q-q plots.

ggplot(df, aes(sample=height)) + stat_qq(distribution=qnorm) +
  stat_qq_line( col = "blue") +
  xlab("Unit normal quantile") + ylab("Height") +
  facet_wrap(~voice.part, nrow = 1)

When comparing a batch of values to a normal distribution, we are looking for a point pattern that follows the fitted line at the core of the dataset (i.e. between -1 and 1 standard deviations). If a systematic deviation from the straight line is observed (such as a curved pattern, for example), then this may be evidence against an assumption of normality. For the most part, all eight batches in our example appear to follow a normal distribution.

18.4 What would a dataset pulled from a normal distribution look like?

Simulations are a great way to develop an intuitive feel for what a dataset pulled from a normal distribution might look like in a normal q-q plot. You will seldom come across perfectly normal data in the real world. Noise is an inherent part of any underlying process. As such, random noise can influence the shape of a q-q plot despite the data coming from a normal distribution. This is especially true with small datasets as demonstrated in the following example where we simulate five small batches of values pulled from a normal distribution. The rnorm function is used in this example to randomly pick a number from a normal distribution whose mean is set to the mean of the alto values and whose standard deviation is set to the standard deviation of the alto values. We also round the values to mimic the rounding of height values observed in the singer dataset.

set.seed(321)  # Sets random generator seed for consistent output

# Simulate values from a normal distribution
sim <- data.frame(sample = paste0("Sample",1:5),
                  value  = round(rnorm(length(alto)*5, 
                                 mean = mean(alto), sd = sd(alto))))

# Generate q-q plots of the simulated values
ggplot(sim, aes(sample = value)) + stat_qq(distribution = qnorm) +
  stat_qq_line(line.p = c(0.25, 0.75), col = "blue") +
  xlab("Unit normal quantile") + ylab("Simulated normals") +
  facet_wrap(~ sample, nrow = 1) 

Of the five simulated batches, Sample3 generates a textbook normal q-q plot that one would expect from a normally distributed batch of values. Sample2 could lead one to question whether the data were pulled from a normal distribution, even though we know that they were!

The singer height normal q-q plots do not look different from some of these simulated plots. In fact, they probably look more Normal then the simulated set of values! This lends confidence in our earlier verdict that the singer height distributions can be characterized by a normal distribution.

18.5 How normal q-q plots behave in the face of skewed data

It can be helpful to simulate distributions of difference skewness to see how a normal quantile plot may behave. In the following figure, the top row shows different density distribution plots; the bottom row shows the normal q-q plots for each distribution.