27  Slicing data

dplyr ggplot2 tidyr
1.1.4 3.4.4 1.3.1

So far, we’ve assumed that the underlying process that relates the \(x\) and \(y\) variables is homogeneous across the full range of x-values. This assumption simplifies model fitting strategies. But sometimes, that assumption does not hold as we will see next.

27.1 Introduction

Let’s start off by downloading a dataset.

df <- read.csv("http://mgimond.github.io/ES218/Data/Sample1.csv")

Next, we’ll plot the data and fit a straight line.

library(ggplot2)

ggplot(df, aes(x = x,y = y)) + geom_point(alpha = 0.3) +
  stat_smooth(method = "lm", se = FALSE)

The line seems to do a decent job in depicting the overall trend but the relationship does not appear perfectly linear. Let’s check the residuals via a residual-dependence plot.

M1    <- lm( y ~ x, df)
df$residuals <- residuals(M1)

ggplot(df, aes(x = x, y = residuals)) + geom_point(alpha = 0.3) +
  stat_smooth(method = "loess", se = FALSE, span = 0.2, 
              method.args = list(degree = 1))

There appears to be a dip in residual values between ~95 and ~107 followed by an increase beyond ~107. The kinks in the residuals seem to delineate three perfectly straight line segment suggesting that the raw data may be modeled using three distinct lines (i.e. with differing slopes and intercepts). Note the use of a small loess span in the above code to reveal the kinks in the pattern.

Sometimes, the data may represent outcomes from different processes given different ranges in independent (x) values. Since the residual plot seems to suggest that the kinks occur are x=95 and x=106, we’ll split the data into three groups: less than 95, between 95 and 106 and greater than 106. These groups will be labeled 1, 2, and 3 and will be assigned to a new column called grp. This new column will then be used to facet the scatter plots and associated loess curve. We’ll use of the case_when() function to perform this task.

library(dplyr)
df2 <- df %>% mutate(grp = case_when(x < 95 ~ 1,
                                     x < 106 ~ 2,
                                     x >= 106 ~ 3))

ggplot(df2, aes(x = x,y = y)) + geom_point(size = 0.5, alpha = 0.3) +
  stat_smooth(method = "loess", se = FALSE) + facet_grid(. ~ grp)

The segmented plots seem to confirm our earlier suspicion that the data followed three separate linear processes. We can extract the slope and intercept for each segment using the following chunk of code:

library(tidyr)
df2  %>% 
  group_by(grp) %>% 
  do( M1 = (lm(y ~ x, . ) ) )  %>% 
  mutate(intercept = coef(M1)[1],
         slope = coef(M1)[2]) %>% 
  select(-M1)
# A tibble: 3 × 3
# Rowwise: 
    grp intercept slope
  <dbl>     <dbl> <dbl>
1     1      1.41  2.09
2     2    103.    1.04
3     3    -53.4   2.50

So, three separate models were needed to capture, what we believed to be, three different underlying processes.

Exploring the residuals using the loess fit was pivotal in identifying the breaks in the pattern. But, as we will learn in the next section, these breaks may not always be so obvious in a dataset.

27.3 Reference

Original paper: Vincent, L. A., et al., 2005. Observed trends in indices of daily temperature extremes in South America 1960–2000. J. Climate, 18, 5011–5023.

Comment to the paper Stone, R. J., 2011. Comments on “Observed trends in indices of daily temperature extremes in South America 1960–2000.” J. Climate, 24, 2880–2883.

The reply to the comment Vincent, L. A., et al., 2011. Reply, J. Climate, 24, 2884-2887.