What follows is an example of an exploratory analysis of the Kennebec River (Maine, USA) discharge rates over time. This analysis is conducted in the spirit of John Tukey in that it leverages both re-expression and robustness in teasing out trends in the data.
The data were pulled from the USGS. The gage station site id is
01049500
and is located in Gardiner , Maine (USA). The data span the years 1976 through 2023.
library(dplyr)
library(lubridate)
library(ggplot2)
dat <- readRDS("Kennebec.Rds")
An initial plot of the discharge rate (in ft3/s) follows.
ggplot(dat, aes(Date, Discharge)) + geom_line() +
ylab("Discharge (ft3/s)")
The data exhibit a strong skew. We can check its distribution using a boxplot.
ggplot(dat, aes(y = "", x = Discharge)) + geom_boxplot() + ylab("") + xlab("Discharge (ft3/s)")
A skewed data can bias the results of the analysis in favor of a small percentage of the observations while missing out on what the bulk of the data may be saying about the underlying process. A re-expression of the data is therefore strongly recommended.
A re-expression using a power of 1/5
seems to do a nice
job in rendering a symmetrical distribution.
dat$disp02 <- dat$Discharge^(1/5)
ggplot(dat, aes(y = "", x = disp02)) + geom_boxplot() +
ylab("") + xlab("Discharge (ft3/s)^0.2")
Note that another way to plot the distribution is to keep the original values and scale the grid.
ggplot(dat, aes(y = "", x = Discharge)) + geom_boxplot() + ylab("") + coord_trans(x = scales::boxcox_trans(1/5)) +
xlab("Discharge (ft3/s)")
Here’s a plot of the re-expressed discharge over time.
ggplot(dat, aes(Date, disp02)) + geom_line() +
ylab("Discharge (ft3/s)^0.2")
Given that an overall trend has a variability that seems much smaller than the total variability in the data, we’ll fit a loess to help guide our choice of a parametric model. We’ll plot both a traditional loess and a robust loess.
ggplot(dat, aes(Date, disp02)) + geom_line() +
ylab("Discharge (ft3/s)^0.2") +
geom_smooth(method = "loess", se = FALSE)+
geom_smooth(method = "loess", se = FALSE, col = "red",
method.args = list(family = "symmetric"))
There is no significant difference between the traditional loess (blue line) and the robust loess (red line) which suggest that any outlier in the data does not seem to wield a disproportionate influence on the model fit.
The smooths’ shape is difficult to make out given the range of y axis values. We’ll re-plot the data using a smaller y-axis range.
ggplot(dat, aes(Date, disp02)) + geom_line(alpha = 0.2) +
ylab("Discharge (ft3/s)^0.2") +
geom_smooth(method = "loess", se = FALSE) +
coord_cartesian(ylim = c(2.75, 3.25))
Another approach to fitting a smooth to a noisy dataset is to aggregate the data across the range of x values. In this example, we summarize the discharge values by year. We’ll use the median as the summary statistic.
dat.sub <- dat %>%
mutate(year.int = as.integer(year(Date))) %>%
group_by(year.int) %>%
summarise(med = median(disp02, na.rm = TRUE))
ggplot(dat.sub, aes(year.int, med)) + geom_point() +
geom_smooth(method = "loess", se = FALSE, col = "blue") +
xlab("Year")
In both cases, the patterns appear similar with a relatively linear discharge rate between the years 1975 and 1998 and a parabolic shape between 1998 and 2023.
There are at least two approaches that can be taken in modeling our
observed trend when using a parametric mode: split the data into two
parts and model each trend separately (i.e. pre and post 1998), or fit
an overall model. We’ll adopt the latter. This will help us quantify an
“overall” change in discharge even though the change may not be
monotonic across all years using. We’ll fit a 1st order
polynomial model using both a regular lm
model and the
MASS
package’s robust rlm
model.
ggplot(dat, aes(Date, disp02)) + geom_line(alpha = 0.2) +
ylab("Discharge (ft3/s)^0.2") +
geom_smooth(method = "lm", se = FALSE) +
geom_smooth(method = MASS::rlm, se = FALSE, col = "red") +
coord_cartesian(ylim = c(2.5, 3.5))
The slopes between both the regular and robust regression models are different. This suggests that outliers in the data may be wielding a disproportionate influence on the data (this was not the case with the loess fit). We’ll adopt the robust model’s output in this analysis
M1 <- MASS::rlm(disp02 ~ decimal_date(Date), dat)
The modeled line is of the form \(Discharge^{1/5} = a + b(Year)\) with \(b\) = 0.0047 representing the average
increase in discharge per year when expressed as a power of
0.2
. The overall trend (across all 44 years) explains about
0.20 (ft3/s)(0.2) of the variability in the data,
or, when expressing the value back to the linear scale, 0.205
= 0.0003 ft3/s –a very small fraction of the overall
variability in the data.
To properly explore the residuals for seasonal periodicity, we need to remove any multi-year patterns in the data. This is best performed using a loess. After exploring different loess spans, a span width of 0.3 was chosen. This is a smaller span than that used earlier when we modeled the overall trend (recall that we are seeking to remove multi-year patterns in this phase of the analysis).
ggplot(dat, aes(Date, disp02)) + geom_line(alpha = 0.2) +
ylab("Discharge (ft3/s)^0.2") +
geom_smooth(method = "loess", se = FALSE, span = 0.3) +
coord_cartesian(ylim = c(2.5, 3.5))
L1 <- loess(disp02 ~ decimal_date(Date), dat, span = 0.3)
dat$res1 <- L1$residuals
The residual plot follows:
ggplot(dat, aes(Date, res1)) + geom_line() +
ylab("Residuals")
The following plot groups the data by month of the year.
dat$month <- month(dat$Date, label = TRUE)
ggplot(dat, aes(month, res1)) + geom_boxplot() +
ylab("Residuals")
A seasonal component can be clearly discerned from the above boxplots. From the plot, we can deduce that the seasonal component explains about 2.5 (ft3/s)(0.2) (or 2.55 ~ 100 ft3/s) of the variability in the data–a much bigger spread than that computed from the overall trend.
ggplot(dat, aes(Date, res1)) + geom_point(alpha = 0.1, cex = 0.5) +
facet_wrap(~ month, nrow = 1) +
geom_smooth(method = "lm", se = FALSE,
method.args = list(family = "symmetric")) +
scale_x_continuous(breaks = c(as.Date("1976-1-1"), as.Date("2023-1-1")), label = c("1977", "2023")) +
theme(axis.text.x.bottom = element_text(angle = -45, size = 7, hjust = -0.3))
There seems to be an uptick in the rates of discharge for the winter months and a decrease in discharge for the spring months. This suggests an increase in amplitude for the winter/spring months, or an offset in amplitude.
Next, we’ll create a new variable, day
, that will give
us the day of the year. This will allow us to construct a three
dimensional dataset with day and year mapped to the x and y axes
respectively, and the residuals mapped to the z axis. The latter will be
visualized using either contours or filled tiles (or both).
# heat map of residual by day
dat.tile <- dat %>%
group_by(year = as.integer(year(Date)), day = yday(Date)) %>%
summarise(res1m = median(res1))
ggplot(dat.tile, aes(x = day, y = year, fill = res1m)) +
geom_tile() +
# scale_fill_gradient2(low="green", mid = "yellow", high = "red",
# midpoint = 0)
scico::scale_fill_scico(palette = "vik", direction = -1, name = "Residual Discharge")
Given that we have quite a bit of noise in the data that may be
suppressing the seasonal pattern, it may be best to apply a smooth to
the data. Here, we’ll adopt a loess smooth. A loess span of
0.2
is adopted to help capture small (time) scale
variations on the data. To avoid edge effects at the beginning and end
of each year, the data are wrapped. But wrapping is not applied to the
ends of the date ranges (i.e. prior to 1976 and after 2023).
dat.tile.pre <- dat.tile %>%
filter(day > 250) %>%
mutate(year = year + 1,
day = day - 366)
dat.tile.post <- dat.tile %>%
filter(day < 100) %>%
mutate(year = year - 1,
day = day + 366)
dat.tile.wrap <- rbind(dat.tile, dat.tile.pre, dat.tile.post)
l3d <- loess(res1m ~ year + day, dat.tile.wrap, span = 0.2,
family = "symmetric", degree = 2)
smooth.df <- data.frame(l3d$x, l3d$fitted) %>%
filter(day > 0 & day < 366)
#dat.tile$loess <- l3d$fitted # Note that we are extracting the loess fitted model
# and not the loess residuals. But recall that the
# data being modeled are the discharge residuals
ggplot(smooth.df) + aes( day , year,
fill = l3d.fitted,
z = l3d.fitted) +
geom_raster(interpolate = TRUE) +
scico::scale_fill_scico(palette = "vik", direction = -1, name = "Residual Discharge") +
geom_contour(size = 0.2, color = "#EE9A00",
breaks = seq(-0.5, 0.5, by = 0.1)) +
geom_vline(xintercept = c(80, 172, 264), col = "grey90", lty=2 ) +
scale_x_continuous(breaks = c(80, 172, 264),
labels = c("Mar 21", "Jun 21", "Sep 21"),
name = NULL)
The smoothed surface has helped us identify a shift in spring river discharge towards earlier months (though a lull in spring discharge is observed for the time period between 2005 and 2015).
The overall (45 year) trend in discharge rates is a very small contributor to the observed variability in the data. The seasonal component explains about 100 ft3/s of the overall variability. The largest contributor to variability appears to be random fluctuations in rates.
We’ve also observed a seasonal shift towards winter months in winter/spring discharge rates. This shift is not observed for the summer and early fall months.
lm
suggests an
increase of 2.2 ft3/s per year–a much greater increase than
what was observed with the re-expressed version of the data. This trend
is probably more reflective of the trends in extreme discharges in the
upper range of values and not the bulk of discharge values.
ggplot(dat, aes(Date, Discharge)) + geom_line(alpha = 0.2) +
ylab("Discharge (ft3/s)") +
geom_smooth(method = "lm", se = FALSE) +
geom_smooth(method = MASS::rlm, se = FALSE, col = "red") +
coord_cartesian(ylim = c(0,500))