The dataset used in this tutorial was downloaded from NOAA’ NDBC buoy data center in a space delimited file format. It consists of hourly air and ocean physical measurements such as temperature, wind speed and wave height for 2012 measured at NOAA’s buoy #44005 located in the Gulf of Maine (43.204 N, 69.128 W).

Type the following to load the data file into your current R session (note that we are using the read.table function instead of the read.csv function since the data are not comma delimited):

dat <- read.table("http://mgimond.github.io/ES218/Data/buoy_44005_2012.dat", header=TRUE)

Since read.table defaults to a space separated format (the same format our data file is stored in), there is no need to specify the sep= parameter.

# Exploring a dataframe

First, let’s extract the number of rows and columns in our table:

dim(dat)
[1] 8347   18

dat consists of 8347 rows and 18 columns.

We can extract the names of each column using the names function:

names(dat)
 [1] "YY"   "MM"   "DD"   "hh"   "mm"   "WDIR" "WSPD" "GST"  "WVHT" "DPD"
[11] "APD"  "MWD"  "PRES" "ATMP" "WTMP" "DEWP" "VIS"  "TIDE"

A brief description of each column follows:

Field Description
YY Year
MM Month
DD Day
hh Hour
mm Minute
WDIR Wind direction (the direction the wind is coming from in degrees clockwise from true N)
WSPD Wind speed, averaged over an eight-minute period (m/s)
GST Peak 5 or 8 second gust speed measured during the eight-minute or two-minute period(m/s)
WVHT Significant wave height (meters)
DPD Dominant wave period (seconds)
APD Average wave period (seconds)
MWD The direction from which the waves at the dominant period (DPD) are coming. The units are degrees from true North, increasing clockwise, with North as 0 (zero) degrees and East as 90 degrees.
PRES Sea level pressure (hPa)
ATMP Air temperature (Celsius)
WTMP Sea surface temperature (Celsius)
DEWP Dewpoint temperature (Celsius)
VIS visibility (nautical miles)
TIDE Water level in feet above or below Mean Lower Low Water, MLLW (feet)

For large datasets, it’s sometimes helpful to display just the first few lines of the table to get a sense of the kind of data we are dealing with. We’ll use the head function to do this and have R return the first five records of the table (n=5).

head( dat, n=5)
    YY MM DD hh mm WDIR WSPD  GST WVHT  DPD  APD MWD   PRES ATMP WTMP DEWP VIS
1 2011 12 31 23 50  316  4.8  6.0 0.92 5.56 4.88 999 1008.1  6.8  8.1  6.8  99
2 2012  1  1  0 50  296  5.5  6.3 0.89 5.56 4.60 999 1007.8  6.1  8.1  6.1  99
3 2012  1  1  1 50  311  8.4 10.0 0.87 8.33 4.09 999 1008.2  5.5  8.1  5.5  99
4 2012  1  1  2 50  314  8.3 10.0 0.93 3.45 4.05 999 1008.0  5.4  8.1  5.4  99
5 2012  1  1  3 50  313 11.1 13.0 1.12 3.85 3.88 999 1008.0  5.8  8.1  5.8  99
TIDE
1   99
2   99
3   99
4   99
5   99

We can also display the last few lines of the table using the tail function.

tail( dat, n =5)
       YY MM DD hh mm WDIR WSPD GST WVHT   DPD  APD MWD   PRES ATMP WTMP DEWP
8343 2012 12 14  1 50  999   99  99 0.31  9.09 5.15 999 1026.3  5.6  9.7 -4.1
8344 2012 12 14  2 50  999   99  99 0.33  9.09 4.69 999 1025.5  5.6  9.7 -4.2
8345 2012 12 14  3 50  999   99  99 0.36 12.90 4.04 999 1025.0  5.6  9.7 -4.2
8346 2012 12 14  4 50  999   99  99 0.36 12.90 3.87 999 1024.6  5.6  9.7 -4.0
8347 2012 12 14  5 50  999   99  99 0.37  8.33 3.66 999 1023.8  5.6  9.6 -3.9
VIS TIDE
8343  99   99
8344  99   99
8345  99   99
8346  99   99
8347  99   99

To view the data type/class associated with each variable in the table, we’ll use the str function.

str(dat)
'data.frame':   8347 obs. of  18 variables:
$YY : int 2011 2012 2012 2012 2012 2012 2012 2012 2012 2012 ...$ MM  : int  12 1 1 1 1 1 1 1 1 1 ...
$DD : int 31 1 1 1 1 1 1 1 1 1 ...$ hh  : int  23 0 1 2 3 4 5 6 7 8 ...
$mm : int 50 50 50 50 50 50 50 50 50 50 ...$ WDIR: int  316 296 311 314 313 319 319 318 332 322 ...
$WSPD: num 4.8 5.5 8.4 8.3 11.1 11.1 11.1 10.5 9.8 8.2 ...$ GST : num  6 6.3 10 10 13 13.3 13.2 12.5 11.5 9.6 ...
$WVHT: num 0.92 0.89 0.87 0.93 1.12 1.25 1.48 1.46 1.41 1.23 ...$ DPD : num  5.56 5.56 8.33 3.45 3.85 4.76 4.55 5 5 5 ...
$APD : num 4.88 4.6 4.09 4.05 3.88 4.06 4.31 4.39 4.31 4.15 ...$ MWD : int  999 999 999 999 999 999 999 999 999 999 ...
$PRES: num 1008 1008 1008 1008 1008 ...$ ATMP: num  6.8 6.1 5.5 5.4 5.8 5.8 6 6.5 6.8 6.7 ...
$WTMP: num 8.1 8.1 8.1 8.1 8.1 8 8 8.1 8.1 8.1 ...$ DEWP: num  6.8 6.1 5.5 5.4 5.8 5.7 5.8 5.5 5.1 5.2 ...
$VIS : num 99 99 99 99 99 99 99 99 99 99 ...$ TIDE: num  99 99 99 99 99 99 99 99 99 99 ...

All values are stored as numbers with about a third of those stored as integers, int, and the others as double precision, num.

Now let’s generate a descriptive summary of the table.

summary(dat)
       YY             MM               DD             hh              mm
Min.   :2011   Min.   : 1.000   Min.   : 1.0   Min.   : 0.00   Min.   :50
1st Qu.:2012   1st Qu.: 3.000   1st Qu.: 8.0   1st Qu.: 5.00   1st Qu.:50
Median :2012   Median : 6.000   Median :15.0   Median :11.00   Median :50
Mean   :2012   Mean   : 6.234   Mean   :15.4   Mean   :11.49   Mean   :50
3rd Qu.:2012   3rd Qu.: 9.000   3rd Qu.:23.0   3rd Qu.:17.50   3rd Qu.:50
Max.   :2012   Max.   :12.000   Max.   :31.0   Max.   :23.00   Max.   :50
WDIR            WSPD           GST             WVHT
Min.   :  1.0   Min.   : 0.0   Min.   : 0.00   Min.   : 0.260
1st Qu.:143.0   1st Qu.: 3.9   1st Qu.: 4.80   1st Qu.: 0.740
Median :213.0   Median : 6.2   Median : 7.40   Median : 1.070
Mean   :234.3   Mean   :10.5   Mean   :11.81   Mean   : 1.624
3rd Qu.:285.0   3rd Qu.: 8.9   3rd Qu.:10.70   3rd Qu.: 1.580
Max.   :999.0   Max.   :99.0   Max.   :99.00   Max.   :99.000
DPD              APD              MWD           PRES
Min.   : 2.600   Min.   : 3.120   Min.   :999   Min.   : 981.3
1st Qu.: 5.560   1st Qu.: 4.410   1st Qu.:999   1st Qu.:1009.2
Median : 7.140   Median : 5.000   Median :999   Median :1014.9
Mean   : 7.982   Mean   : 5.551   Mean   :999   Mean   :1017.9
3rd Qu.: 9.090   3rd Qu.: 5.830   3rd Qu.:999   3rd Qu.:1020.5
Max.   :99.000   Max.   :99.000   Max.   :999   Max.   :9999.0
ATMP             WTMP            DEWP              VIS          TIDE
Min.   :-11.00   Min.   : 5.40   Min.   :-16.000   Min.   :99   Min.   :99
1st Qu.:  5.60   1st Qu.: 7.10   1st Qu.:  0.900   1st Qu.:99   1st Qu.:99
Median : 10.40   Median :11.60   Median :  8.600   Median :99   Median :99
Mean   : 10.78   Mean   :12.17   Mean   :  8.268   Mean   :99   Mean   :99
3rd Qu.: 16.80   3rd Qu.:16.30   3rd Qu.: 15.300   3rd Qu.:99   3rd Qu.:99
Max.   : 25.10   Max.   :24.30   Max.   :999.000   Max.   :99   Max.   :99  

You’ll note that many columns in our data contain values of 99 or 999. Such values are often placeholders for missing data. If not dealt with properly, these values may be interpreted as valid data in subsequent analyses. One tell-tale sign that these values should be treated specially is their extreme values compared to the rest of the data batch. For example, the wave height field, WVHT, has values that average less than 2 meters so a value of 99 meters should be flagged as suspicious. In fact, NOAA’s documentation states that “Missing data … are denoted by … a variable number of 9’s … depending on the data type (for example: 999.0 99.0).”

We also note that some columns have nothing but missing values (e.g. VIS and TIDE) suggesting that either these variables are not measured at this particular buoy or the instruments did not function properly during the data collection period.

Before proceeding with any analysis, we need to address the missing values by flagging them as such or removing them all together.

# Removing columns with no data

Let’s first remove all columns devoid of valid data. These columns are MWD, VIS and TIDE. There are many ways to do this. We will opt for subsetting the table and assigning this subset to a new data frame we will call dat1.

dat1 <- dat[ , !(names(dat) %in% c("MWD", "VIS", "TIDE"))]

Let’s break down this expression into its many components. names(dat) returns a list of column names in the order in which they appear in the object dat: YY, MM, DD, hh, mm, WDIR, WSPD, GST, WVHT, DPD, APD, MWD, PRES, ATMP, WTMP, DEWP, VIS, TIDE.

We use the match operator, %in%, to evaluate if the column names are included in the set of elements defined in c("MWD", "VIS", "TIDE"). The expression could be read as “… are any of the column names (i.e. YY, MM, DD, hh, mm, WDIR, WSPD, GST, WVHT, DPD, APD, MWD, PRES, ATMP, WTMP, DEWP, VIS, TIDE) included in the set ("MWD", "VIS", "TIDE")?” The output of this comparison is a series of TRUE’s and FALSE’s depending on whether or not a match is found.

names(dat) %in% c("MWD", "VIS", "TIDE")
 [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE
[13] FALSE FALSE FALSE FALSE  TRUE  TRUE

The order in which the TRUE’s and FALSE’s are listed match the order of the input column names. This order is important because this index is what R uses to identify which columns to return. But wait, R will return columns associated with a TRUE value! We want the reverse, so we add the NOT operator, !, in front of the expression thus flipping the TRUE/FALSE values.

!(names(dat) %in% c("MWD", "VIS", "TIDE"))
 [1]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE
[13]  TRUE  TRUE  TRUE  TRUE FALSE FALSE

So dat1 contains all the columns except MWD, VIS and TIDE.

head(dat1)
    YY MM DD hh mm WDIR WSPD  GST WVHT  DPD  APD   PRES ATMP WTMP DEWP
1 2011 12 31 23 50  316  4.8  6.0 0.92 5.56 4.88 1008.1  6.8  8.1  6.8
2 2012  1  1  0 50  296  5.5  6.3 0.89 5.56 4.60 1007.8  6.1  8.1  6.1
3 2012  1  1  1 50  311  8.4 10.0 0.87 8.33 4.09 1008.2  5.5  8.1  5.5
4 2012  1  1  2 50  314  8.3 10.0 0.93 3.45 4.05 1008.0  5.4  8.1  5.4
5 2012  1  1  3 50  313 11.1 13.0 1.12 3.85 3.88 1008.0  5.8  8.1  5.8
6 2012  1  1  4 50  319 11.1 13.3 1.25 4.76 4.06 1008.8  5.8  8.0  5.7

# Assigning NA to missing data

It’s usually best to assign NA values the moment a data file is loaded into R. For example, had all NA values been flagged as -9999, we could have added the parameter na.strings="-9999" to the read.table() function. But our dataset contains missing values flagged as 99, 999 and 9999 requiring that we combine the missing string values such as na.strings = c("99", "999", "9999") however, this would have resulted in undesirable outcomes. For example, a wave height of 99 meters would clearly not make sense, so flagging such value as NA would be sensible, but a wind direction of 99 degrees would make sense and flagging such value as NA would not be appropriate. Had we set all values of 99 to NA the moment we loaded the file into R, we would have inadvertently removed valid values in the dataset. So flagging missing values as NA in our dataset will require column level intervention.

Let’s review the data summary to identify each column’s no data numeric designation (i.e. 99 or 999 or 9999) if present. Remember that we are now working off of the dat1 dataframe and not the original dat dataframe.

summary(dat1)
       YY             MM               DD             hh              mm
Min.   :2011   Min.   : 1.000   Min.   : 1.0   Min.   : 0.00   Min.   :50
1st Qu.:2012   1st Qu.: 3.000   1st Qu.: 8.0   1st Qu.: 5.00   1st Qu.:50
Median :2012   Median : 6.000   Median :15.0   Median :11.00   Median :50
Mean   :2012   Mean   : 6.234   Mean   :15.4   Mean   :11.49   Mean   :50
3rd Qu.:2012   3rd Qu.: 9.000   3rd Qu.:23.0   3rd Qu.:17.50   3rd Qu.:50
Max.   :2012   Max.   :12.000   Max.   :31.0   Max.   :23.00   Max.   :50
WDIR            WSPD           GST             WVHT
Min.   :  1.0   Min.   : 0.0   Min.   : 0.00   Min.   : 0.260
1st Qu.:143.0   1st Qu.: 3.9   1st Qu.: 4.80   1st Qu.: 0.740
Median :213.0   Median : 6.2   Median : 7.40   Median : 1.070
Mean   :234.3   Mean   :10.5   Mean   :11.81   Mean   : 1.624
3rd Qu.:285.0   3rd Qu.: 8.9   3rd Qu.:10.70   3rd Qu.: 1.580
Max.   :999.0   Max.   :99.0   Max.   :99.00   Max.   :99.000
DPD              APD              PRES             ATMP
Min.   : 2.600   Min.   : 3.120   Min.   : 981.3   Min.   :-11.00
1st Qu.: 5.560   1st Qu.: 4.410   1st Qu.:1009.2   1st Qu.:  5.60
Median : 7.140   Median : 5.000   Median :1014.9   Median : 10.40
Mean   : 7.982   Mean   : 5.551   Mean   :1017.9   Mean   : 10.78
3rd Qu.: 9.090   3rd Qu.: 5.830   3rd Qu.:1020.5   3rd Qu.: 16.80
Max.   :99.000   Max.   :99.000   Max.   :9999.0   Max.   : 25.10
WTMP            DEWP
Min.   : 5.40   Min.   :-16.000
1st Qu.: 7.10   1st Qu.:  0.900
Median :11.60   Median :  8.600
Mean   :12.17   Mean   :  8.268
3rd Qu.:16.30   3rd Qu.: 15.300
Max.   :24.30   Max.   :999.000  

It appears that the fields with missing values flagged as 99 are WSPD, GST, WVHT, DPD and APD; fields with missing values flagged as 999 are WDIR and DEWP; and the field with missing values flagged as 9999 is PRES. We will convert these values to missing values in the following chunk of code.

# Create list of columns with missing values
na99   <- c("WSPD", "GST", "WVHT", "DPD",  "APD")
na999  <- c("WDIR",  "DEWP")
na9999 <- ("PRES")

# Replace missing values with NA
dat1[ ,na99][dat1[ ,na99] == 99] <- NA
dat1[ ,na999][dat1[ ,na999] == 999] <- NA
dat1[ ,na9999][dat1[ ,na9999] == 9999] <- NA

The above expression is quite succinct, but succinctness does not always translate to easy readability. There are two components to the last three lines of code: The first is a subset of columns for which a particular missing value is designated (e.g. dat1[,na99] ), the second is an evaluation statement identifying which values in the column subset is flagged as missing (e.g. [dat1[, na99] == 99]). Together, these elements define the indices (table cells) for which a value is replaced with NA (e.g. <- NA).

Another way this could have been accomplished (and one you might feel more comfortable with) would have been to work off of one column at a time, for example:

dat1$WSPD[dat1$WSPD == 99] <- NA
dat1$GST[dat1$GST   == 99] <- NA
dat1$WVHT[dat1$WVHT == 99] <- NA
dat1$DPD[dat1$DPD   == 99] <- NA
dat1$APD[dat1$APD   == 99] <- NA

dat1$WDIR[dat1$WDIR == 999]  <- NA
dat1$DEWP[dat1$DEWP == 999]  <- NA

dat1$PRES[dat1$PRES == 9999] <- NA

This chunk of code is easier to read, but lacks the succinctness of its predecessor.

Let’s look at a summary of our table:

summary(dat1)
       YY             MM               DD             hh              mm
Min.   :2011   Min.   : 1.000   Min.   : 1.0   Min.   : 0.00   Min.   :50
1st Qu.:2012   1st Qu.: 3.000   1st Qu.: 8.0   1st Qu.: 5.00   1st Qu.:50
Median :2012   Median : 6.000   Median :15.0   Median :11.00   Median :50
Mean   :2012   Mean   : 6.234   Mean   :15.4   Mean   :11.49   Mean   :50
3rd Qu.:2012   3rd Qu.: 9.000   3rd Qu.:23.0   3rd Qu.:17.50   3rd Qu.:50
Max.   :2012   Max.   :12.000   Max.   :31.0   Max.   :23.00   Max.   :50

WDIR            WSPD             GST              WVHT
Min.   :  1.0   Min.   : 0.000   Min.   : 0.000   Min.   :0.260
1st Qu.:136.0   1st Qu.: 3.800   1st Qu.: 4.700   1st Qu.:0.740
Median :208.0   Median : 6.000   Median : 7.200   Median :1.060
Mean   :198.2   Mean   : 6.315   Mean   : 7.684   Mean   :1.285
3rd Qu.:273.0   3rd Qu.: 8.500   3rd Qu.:10.200   3rd Qu.:1.570
Max.   :360.0   Max.   :18.800   Max.   :23.300   Max.   :8.650
NA's   :377     NA's   :377      NA's   :377      NA's   :29
DPD              APD              PRES             ATMP
Min.   : 2.600   Min.   : 3.120   Min.   : 981.3   Min.   :-11.00
1st Qu.: 5.560   1st Qu.: 4.410   1st Qu.:1009.2   1st Qu.:  5.60
Median : 7.140   Median : 5.000   Median :1014.9   Median : 10.40
Mean   : 7.664   Mean   : 5.225   Mean   :1014.6   Mean   : 10.78
3rd Qu.: 9.090   3rd Qu.: 5.820   3rd Qu.:1020.5   3rd Qu.: 16.80
Max.   :17.390   Max.   :10.430   Max.   :1040.2   Max.   : 25.10
NA's   :29       NA's   :29       NA's   :3
WTMP            DEWP
Min.   : 5.40   Min.   :-16.000
1st Qu.: 7.10   1st Qu.:  0.900
Median :11.60   Median :  8.600
Mean   :12.17   Mean   :  7.793
3rd Qu.:16.30   3rd Qu.: 15.300
Max.   :24.30   Max.   : 23.900
NA's   :4        

Note the NA’s in the summary output. We now have a clean dataframe to work with.

# Is there an easier way?

Cleaning data tables is an important component of any analytical workflow, but the syntax used thus far can get complicated and be difficult to follow. Fortunately, there is an easier way to clean data tables using the dplyr package which will be covered in Week 3.