Univariate EDA preamble

What You Will Learn in This Section

The chapters in this section walk you through the univariate foundations of Exploratory Data Analysis (EDA). The goal is not to confirm hypotheses, but to build a conceptual and visual toolkit for understanding the structure of a single variable-its shape, spread, location, and how it behaves under modeling and transformation.

Chapter 17: Visualizing Univariate Distributions

This chapter introduces the first line of attack in understanding a batch of data: visual displays. These include histograms, boxplots, and density plots, which are designed to reveal the distributional shape of a variable. The chapter also introduces the concept of quantiles, which are essential for comparing distributions and form the basis for Q-Q plots in the next chapters.

Chapter 18: Compare batches - the empirical QQ plot

Here, we move from describing a single batch to comparing multiple batches. The chapter introduces empirical quantile–quantile (QQ) plots as a powerful alternative to side-by-side boxplots. QQ plots allow for direct visual comparison of shape, spread, and location between two distributions.

Chapter 19: Theoretical QQ Plots

This chapter extends the QQ plot framework by comparing a batch of data to a theoretical distribution (e.g., Normal, exponential). This helps assess whether a dataset conforms to a known reference shape and introduces the idea of using theoretical models as baselines for comparison.

Chapter 20: Fitting and Residuals

We now introduce the idea of fitting a model to univariate data (typically a group specific mean) and analyzing the residuals, which represent the variation not explained by the model. This chapter emphasizes the importance of residual homogeneity: residuals should have similar spread across groups to ensure fair comparisons and valid inferences.

Chapter 21: Visualizing Explained and Unexplained Variation

This chapter builds on the concept of residuals by introducing two visual tools:

The variability decomposition (vd) plot, which compares the spread of fitted values to the spread of residuals.
The residual–fit spread (rfs) plot, which aligns quantiles of fitted values and residuals to assess model fit more precisely.

These tools help quantify how much of the total variation is explained by the model versus what remains unexplained.

Chapter 22: Transforming Data: Re-expression for Shape and Spread

Real-world data often violate assumptions of symmetry, Normality, or equal spread. This chapter introduces re-expression (transformation) as a strategy to:

Symmetrize skewed distributions
Stabilize spread across groups
Normalize data for modeling

You’ll learn about log transformations, Tukey’s ladder of powers, and Box–Cox transformations, and how to choose a transformation based on the shape and behavior of your data.

Chapter 23: Diagnosing Unequal Spread

This chapter focuses on detecting and correcting heteroskedasticity-situations where the spread of residuals changes with the level of the fitted value. You’ll learn to use:

Spread–location plots, which visualize how residual spread varies with group medians
Spread–level plots, which use log–log relationships to suggest power transformations

These tools help determine whether a transformation is needed and what kind might be appropriate.

Chapter 24: Letter value summaries

This final chapter extends the five-number summary of the boxplot by exploring Tukey’s letter-value summaries. These summaries provide a more detailed view of a distribution’s symmetry and tail behavior. They are especially useful when comparing batches of data or when subtle asymmetries matter.

The Big Picture

Together, these chapters form a coherent arc:

Visualize the distribution of a variable.
Compare batches using quantiles.
Model the data using simple fits.
Diagnose what the model misses using residuals.
Quantify explained vs. unexplained variation.
Transform the data when assumptions are violated.
Refine the model by checking for unequal spread.
Summarize distributions with greater depth using letter-value summaries.

This sequence prepares you to move from raw data to interpretable models, using visualization as both a diagnostic and a discovery tool.