The eda_sl
function generates William Cleveland's
spread-location plot for univariate and bivariate data. The function will also
generate Tukeys' spread-level plot.
Usage
eda_sl(
dat,
x = NULL,
fac = NULL,
type = "location",
p = 1,
tukey = FALSE,
base = exp(1),
sprd = "frth",
jitter = 0.01,
robust = TRUE,
loess.d = list(family = "symmetric", degree = 1, span = 1),
label = TRUE,
label.col = "lightsalmon",
xlab = NULL,
ylab = NULL,
labelxbuff = 0.05,
labelybuff = 0.05,
show.par = FALSE,
plot = TRUE,
...
)
Arguments
- dat
Dataframe of univariate data or a linear model.
- x
Continuous variable column (ignored if
dat
is a linear model).- fac
Categorical variable column (ignored if
dat
is a linear model).- type
s-l plot type.
"location"
= spread-location,"level"
= spread-level (only for univariate data)."dependence"
= spread-dependence (only for bivariate model input).- p
Power transformation to apply to variable. Ignored if input is a linear model.
- tukey
Logical; Determines if a Tukey transformation should be adopted (FALSE adopts a Box-Cox transformation).
- base
Base used with the
log()
function ifpx
orpy
is0
.- sprd
Choice of spreads used in the spread-versus-level plot (i.e. when
type = "level"
). Either interquartile,sprd = "IQR"
or fourth-spread,sprd = "frth"
(default).- jitter
Jittering parameter for the spread-location plot. A fraction of the range of location values.
- robust
Logical; Indicates if robust regression should be used on the spread-level plot.
- loess.d
Arguments passed to the internal loess function. Applies only to the bivariate model s-l plots and the spread-level plot.
- label
Logical; Determines if group labels are to be added to the spread-location plot.
- label.col
Color assigned to group labels (only applicable if
type = location
).- xlab
X label for output plot.
- ylab
Y label for output plot.
- labelxbuff
Buffer to add to the edges of the plot to make room for the labels in a spread-location plot. Value is a fraction of the plot width.
- labelybuff
Buffer to add to the top of the plot to make room for the labels in a spread-location plot. Value is a fraction of the plot width.
- show.par
Boolean determining if the power transformation applied to the data should be displayed.
- plot
Logical; Determines if plot should be generated.
- ...
Arguments passed on to
.eda_plot_xy
y
A numeric vector or column name in
dat
for the y-axis.px
Power transformation used in the input data to display if
show.par = TRUE
.py
Power transformation used in the input data to display if
show.par = TRUE
.raw_tick
Logical. If
TRUE
, original (untransformed) equally spaced tick values are displayed on the re-expressed axes.xlim
X-axis range.
ylim
Y-axis range.
reg
Logical; whether to fit and display a regression line.
poly
Integer; regression model polynomial degree (defaults to 1 for linear model).
rlm.d
List; parameters for
MASS::rlm
, (e.g.,list(psi = "psi.bisquare")
).w
Optional numeric vector of weights for regression.
lm.col
Regression line color.
lm.lw
Numeric; Regression line width.
lm.lty
Numeric; Regression line type.
sd
Logical; whether to show ±1 SD lines.
mean.l
Logical; whether to show x and y mean reference lines.
asp
Logical; whether to preserve the aspect ratio (ignored if
square = FALSE
).square
Logical; whether to create a square plotting window.
grey
Numeric between
0-1
; controls grayscale background elements (0 = black
,1 = white
).pch
Integer; point symbol.
p.col
Point border color.
p.fill
Point fill color.
size
Point size.
alpha
Point transparency level (0 = 100\% transparent, 1 = 100\% opaque).
q
Logical; whether to draw inner quantile boxes (quantile shading).
q.type
Integer; type of quantile calculation (see
quantile
).inner
Numeric; defines the inner fraction of values to highlight with quantile shading.
qcol
Fill color of quantile shading.
loe
Logical; whether to plot loess smooth line.
loe.lw
Numeric; Loess smooth line width.
loe.col
Loess smooth color.
loe.lty
Numeric; Loess smooth line type.
stats
Logical; if
TRUE
, displays model statistics (R², β, p-value).stat.size
Text size for
stats
plot display.hline
Numeric; location(s) of additional horizontal reference lines. Can be passed via the
c()
function.vline
Numeric; location(s) of additional vertical reference lines. Can be passed via the
c()
function.
Details
The function generates a few variations of the spread-location/spread-level
plots depending on the data input type and parameter passed to the
type
argument. The residual spreads are mapped to the y-axis and the
levels are mapped to the x-axis. Their values are computed as follows:
type = "location"
(univariate data):
William Cleveland's spread-location plot applied to univariate data.
\(\ spread = \sqrt{|residuals|}\)
\(\ location = medians\)type = "level"
(univariate data):
Tukey's spread-level plot (aka spread-versus-level plot, Hoaglin et al., p 260). If the pattern is close to linear, the plot can help find a power transformation that will help stabilize the spread in the data by subtracting one from the fitted slope. This option outputs the slope of the fitted line in the console. A loess is added to assess linearity. By default, the fourth spread is used to define the spread. Alternatively, the IQR can be used by settingspread = "IQR"
. The output will be nearly identical except for small datasets where the two methods may diverge slightly in output.
\(\ spread = log(fourth\ spread(residuals))\)
\(\ location = log(medians)\)type = "location"
if input is a model of classlm
,eda_lm
oreda_rline
:
William Cleveland's spread-location plot (aka scale-location plot) applied to residuals of a linear model.
\(\ spread = \sqrt{|residuals|}\)
\(\ location = fitted\ values\)type = "dependence"
if input is a model of classlm
,eda_lm
oreda_rline
:
William Cleveland's spread-location plot applied to residuals of a linear model.
\(\ spread = \sqrt{|residuals|}\)
\(\ dependence = x\ variable\)
References
Understanding Robust and Exploratory Data Analysis, Hoaglin, David C., Frederick Mosteller, and John W. Tukey, 1983.
William S. Cleveland. Visualizing Data. Hobart Press (1993)
Examples
cars <- MASS::Cars93
# Cleveland's spread-location plot applied to univariate data
eda_sl(cars, MPG.city, Type)
# You can specify the exact form of the spread on the y-axis
# via the ylab argument
eda_sl(cars, MPG.city, Type, ylab = expression(sqrt(abs(residuals))) )
# The function can also generate Tukey's spread-level plot to identify a
# power transformation that can stabilize spread across fitted values
# following power = 1 - slope
eda_sl(cars, MPG.city, Type, type = "level")
#> int Location^1
#> -8.009091 2.969832
# A slope of around 3 is computed from the s-l plot, therefore, a suggested
# power is 1 - 3 = -2. We can apply a power transformation within the
# function via the p argument. By default, a Box-Cox transformation method
# is adopted.
eda_sl(cars, MPG.city, Type, p = -2)
# Spread-location plot can also be generated from residuals of a linear model
M1 <- lm(mpg ~ hp, mtcars)
eda_sl(M1)
# Spread can be compared to X instead of fitted value
eda_sl(M1, type = "dependence")