Exploratory Data Analysis in R

This repo houses lecture notes used in an Exploratory Data Analysis in R course taught to undergraduates at Colby College. The course assumes little to no background in quantitative analysis nor in computer programming and was first taught in Spring, 2015. The course introduces students to data manipulation in R, data exploration (in the spirit of John Tukey’s EDA) and the R markdown language. Many of the visualization tools are adopted from William Cleveland’s Data Visualization book.

Week 1

The R and R Studio environments

  • Command line vs. script file
  • Packages
    • Base packages
    • Installing packages from CRAN
    • Installing packages from GitHub
    • Using a package in a R session
  • Getting a session’s info
  • A brief video intro to the RStudio environment

Week 2

Data object type and structure

  • Data types (aka mode) in R: numeric, character, factor, logical, dates
  • Data objects in R: vector, data frame, lists
  • Coercing data from one type to another

How to read and create data files in R

  • Reading/writing files: comma delimited (CSV), tab delimited, R files
  • Data can be loaded from a file residing on your local drive or on the web
  • Importing files from other formats such as Excel, STATA and SAS

Working with Date objects

  • Creating date/time objects using lubridate
  • Extracting data information
  • Concatenating vectors with paste
  • Operating on dates
  • Formatting date objects

Exploring and cleaning dataframes using base functions

  • Exploring the table with dim, names, head, str and summary
  • Deleting columns
  • Replacing values with NA [video]
  • The matching operator %in% [video]
  • The Boolean operators &, | and !

Week 3

Manipulating dataframes with dplyr

  • Subsetting by row values: filter
  • Subsetting columns: select
  • Computing and/or adding columns: mutate
  • Summarizing by category: group_by and summarise
  • Sorting by column values: arrange
  • Combining operations using the pipe %>% operator
  • Conditional statements ifelse, if_else, case_when and recode

Tidying/reshaping tables using tidyr

  • From long to wide: spread
  • From wide to long: gather
  • Combining columns: unite
  • Splitting columns: separate
  • Adding missing records: complete

Example of a few data manipulation workflows

  • These are data objects created for use in subsequent tutorials

Week 4

Joining data tables

  • Left join
  • Right join
  • Inner join
  • Full join

Working with string objects

  • Finding patterns in a string
  • Modifying strings
  • Extracting parts of a string

Week 5

The base plotting environment

  • Point and line plots
  • Boxplots
  • Histograms
  • Density plot
  • Point and line symbols
  • Exporting to files (tiff and PDF)

The lattice plotting environment (optional)

  • Conditioning on factors
  • Displaying univariate data
  • Displaying multivariate data
  • Customizing trellis plots

The ggplot plotting environment

  • Line geometry geom_line
  • Point geometry geom_point
  • Boxplot geometry geom_boxplot
  • Histogram geometry geom_histograms
  • Violin plot geometry geom_violin
  • Combining geometries and layers
  • Faceting plots (trellis plots)

Manipulating colors in R

  • The rgb() function

Week 6

The R Markdown language

Visualizing univariate data

  • Boxplots
  • Histograms
  • Quantile plots

Comparing univariate data distributions

  • Side-by-side boxplots
  • Quantile-quantile plots (q-q)
  • Tukey mean-difference plots (m-d)

Week 7

The theoretical q-q plot

  • Imposing a structure: the normal distribution

Week 8

Fits and residuals

  • Fitting univariate data
  • Extracting the residuals
  • Residual-fit spread plot

Spread-location plot

  • Detecting changes in the spread
  • Interpreting and s-l plot

Week 9

Re-expressing data

  • Log transformation
  • Box-Cox family of transformations
  • How quantile plots behave in the face of skewed data

Letter value summaries

  • Constructing letter value summaries
  • Interpreting letter value summaries

The Two R’s of EDA

  • Robustness
  • Re-expression

Week 10

Bivariate analysis

  • Scatter plots
  • Fitting the data
    • Parametric fit
    • LOESS fit
  • Residuals
    • Residual dependence plot
    • spread-location plot

Resistant lines

  • Robust lines
    • Tukey’s 3-point summary
    • Bisquare estimation method
  • Robust loess

Week 11

The third R of EDA: Residuals

  • Exploring atmospheric CO2 data

Detecting discontinuities in the data

  • Slicing data
  • Changepoint detection

Week 12

Two-way tables

  • Median polish
  • Mean polish and the two-way ANOVA analysis

Week 13

Creating maps in R

  • Using maps package datasets
  • Loading custom shapefiles
  • continuous vs. discrete color schemes

Relational databases

  • Querying with SQL
  • Querying with dplyr

Software version

This site was built with R 3.4.0 and the following packages:

##  package      * version  date       source        
##  assertthat     0.2.0    2017-04-11 CRAN (R 3.4.0)
##  BH             1.62.0-1 2016-11-19 CRAN (R 3.4.0)
##  bindr          0.1      2016-11-13 CRAN (R 3.4.0)
##  bindrcpp       0.2      2017-06-17 CRAN (R 3.4.1)
##  cellranger     1.1.0    2016-07-27 CRAN (R 3.4.0)
##  colorspace     1.3-2    2016-12-14 CRAN (R 3.4.0)
##  dichromat      2.0-0    2013-01-24 CRAN (R 3.4.0)
##  digest         0.6.12   2017-01-27 CRAN (R 3.4.0)
##  dplyr          0.7.1    2017-06-22 CRAN (R 3.4.1)
##  ggplot2        2.2.1    2016-12-30 CRAN (R 3.4.0)
##  glue           1.1.1    2017-06-21 CRAN (R 3.4.1)
##  graphics     * 3.4.0    2017-04-21 local         
##  grDevices    * 3.4.0    2017-04-21 local         
##  grid           3.4.0    2017-04-21 local         
##  gtable         0.2.0    2016-02-26 CRAN (R 3.4.0)
##  labeling       0.3      2014-08-23 CRAN (R 3.4.0)
##  lattice        0.20-35  2017-03-25 CRAN (R 3.4.0)
##  lazyeval       0.2.0    2016-06-12 CRAN (R 3.4.0)
##  lubridate      1.6.0    2016-09-13 CRAN (R 3.4.0)
##  magrittr       1.5      2014-11-22 CRAN (R 3.4.0)
##  MASS           7.3-47   2017-02-26 CRAN (R 3.4.0)
##  methods      * 3.4.0    2017-04-21 local         
##  munsell        0.4.3    2016-02-13 CRAN (R 3.4.0)
##  pkgconfig      2.0.1    2017-03-21 CRAN (R 3.4.0)
##  plogr          0.1-1    2016-09-24 CRAN (R 3.4.0)
##  plyr           1.8.4    2016-06-08 CRAN (R 3.4.0)
##  R6             2.2.2    2017-06-17 CRAN (R 3.4.1)
##  RColorBrewer   1.1-2    2014-12-07 CRAN (R 3.4.0)
##  Rcpp           0.12.12  2017-07-15 CRAN (R 3.4.1)
##  readxl         1.0.0    2017-04-18 CRAN (R 3.4.0)
##  rematch        1.0.1    2016-04-21 CRAN (R 3.4.0)
##  reshape2       1.4.2    2016-10-22 CRAN (R 3.4.0)
##  rlang          0.1.1    2017-05-18 CRAN (R 3.4.0)
##  scales         0.4.1    2016-11-09 CRAN (R 3.4.0)
##  stats        * 3.4.0    2017-04-21 local         
##  stringi        1.1.5    2017-04-07 CRAN (R 3.4.0)
##  stringr        1.2.0    2017-02-18 CRAN (R 3.4.0)
##  tibble         1.3.3    2017-05-28 CRAN (R 3.4.0)
##  tidyr          0.6.3    2017-05-15 CRAN (R 3.4.0)
##  tools          3.4.0    2017-04-21 local         
##  utils        * 3.4.0    2017-04-21 local