The dataset used in this tutorial was downloaded from NOAA’ NDBC buoy data center in a space delimited file format. It consists of hourly air and ocean physical measurements such as temperature, wind speed and wave height for 2012 measured at NOAA’s buoy #44005 located in the Gulf of Maine (43.204 N, 69.128 W).

Type the following to load the data file into your current R session (note that we are using the read.table function instead of the read.csv function since the data are not comma delimited):

dat <- read.table("http://mgimond.github.io/ES218/Data/buoy_44005_2012.dat", sep="", header=TRUE)

Exploring a dataframe

First, let’s extract the number of rows and columns in our table:

dim(dat)
[1] 8347   18

dat consists of 8347 rows and 18 columns.

We can extract the names of each column using the names function:

names(dat)
 [1] "YY"   "MM"   "DD"   "hh"   "mm"   "WDIR" "WSPD" "GST"  "WVHT" "DPD"  "APD"  "MWD"  "PRES" "ATMP" "WTMP" "DEWP"
[17] "VIS"  "TIDE"

A brief description of each column follows:

Field Description
YY Year
MM Month
DD Day
hh Hour
mm Minute
WDIR Wind direction (the direction the wind is coming from in degrees clockwise from true N)
WSPD Wind speed, averaged over an eight-minute period (m/s)
GST Peak 5 or 8 second gust speed measured during the eight-minute or two-minute period(m/s)
WVHT Significant wave height (meters)
DPD Dominant wave period (seconds)
APD Average wave period (seconds)
MWD The direction from which the waves at the dominant period (DPD) are coming. The units are degrees from true North, increasing clockwise, with North as 0 (zero) degrees and East as 90 degrees.
PRES Sea level pressure (hPa)
ATMP Air temperature (Celsius)
WTMP Sea surface temperature (Celsius)
DEWP Dewpoint temperature (Celsius)
VIS visibility (nautical miles)
TIDE Water level in feet above or below Mean Lower Low Water, MLLW (feet)

For large datasets, it’s sometimes helpful to display just the first few lines of the table to get a sense of the kind of data we are dealing with. We’ll use the head function to do this and have R return the first five records of the table (n=5).

head( dat, n=5)
    YY MM DD hh mm WDIR WSPD  GST WVHT  DPD  APD MWD   PRES ATMP WTMP DEWP VIS TIDE
1 2011 12 31 23 50  316  4.8  6.0 0.92 5.56 4.88 999 1008.1  6.8  8.1  6.8  99   99
2 2012  1  1  0 50  296  5.5  6.3 0.89 5.56 4.60 999 1007.8  6.1  8.1  6.1  99   99
3 2012  1  1  1 50  311  8.4 10.0 0.87 8.33 4.09 999 1008.2  5.5  8.1  5.5  99   99
4 2012  1  1  2 50  314  8.3 10.0 0.93 3.45 4.05 999 1008.0  5.4  8.1  5.4  99   99
5 2012  1  1  3 50  313 11.1 13.0 1.12 3.85 3.88 999 1008.0  5.8  8.1  5.8  99   99

We can also display the last few lines of the table using the tail function.

tail( dat, n =5)
       YY MM DD hh mm WDIR WSPD GST WVHT   DPD  APD MWD   PRES ATMP WTMP DEWP VIS TIDE
8343 2012 12 14  1 50  999   99  99 0.31  9.09 5.15 999 1026.3  5.6  9.7 -4.1  99   99
8344 2012 12 14  2 50  999   99  99 0.33  9.09 4.69 999 1025.5  5.6  9.7 -4.2  99   99
8345 2012 12 14  3 50  999   99  99 0.36 12.90 4.04 999 1025.0  5.6  9.7 -4.2  99   99
8346 2012 12 14  4 50  999   99  99 0.36 12.90 3.87 999 1024.6  5.6  9.7 -4.0  99   99
8347 2012 12 14  5 50  999   99  99 0.37  8.33 3.66 999 1023.8  5.6  9.6 -3.9  99   99

We can usually tell what data types are associated with each column by viewing the first few lines. But it’s best to check with the str function.

str(dat)
'data.frame':   8347 obs. of  18 variables:
 $ YY  : int  2011 2012 2012 2012 2012 2012 2012 2012 2012 2012 ...
 $ MM  : int  12 1 1 1 1 1 1 1 1 1 ...
 $ DD  : int  31 1 1 1 1 1 1 1 1 1 ...
 $ hh  : int  23 0 1 2 3 4 5 6 7 8 ...
 $ mm  : int  50 50 50 50 50 50 50 50 50 50 ...
 $ WDIR: int  316 296 311 314 313 319 319 318 332 322 ...
 $ WSPD: num  4.8 5.5 8.4 8.3 11.1 11.1 11.1 10.5 9.8 8.2 ...
 $ GST : num  6 6.3 10 10 13 13.3 13.2 12.5 11.5 9.6 ...
 $ WVHT: num  0.92 0.89 0.87 0.93 1.12 1.25 1.48 1.46 1.41 1.23 ...
 $ DPD : num  5.56 5.56 8.33 3.45 3.85 4.76 4.55 5 5 5 ...
 $ APD : num  4.88 4.6 4.09 4.05 3.88 4.06 4.31 4.39 4.31 4.15 ...
 $ MWD : int  999 999 999 999 999 999 999 999 999 999 ...
 $ PRES: num  1008 1008 1008 1008 1008 ...
 $ ATMP: num  6.8 6.1 5.5 5.4 5.8 5.8 6 6.5 6.8 6.7 ...
 $ WTMP: num  8.1 8.1 8.1 8.1 8.1 8 8 8.1 8.1 8.1 ...
 $ DEWP: num  6.8 6.1 5.5 5.4 5.8 5.7 5.8 5.5 5.1 5.2 ...
 $ VIS : num  99 99 99 99 99 99 99 99 99 99 ...
 $ TIDE: num  99 99 99 99 99 99 99 99 99 99 ...

All values are stored as numbers with about a third stored as integers, int. The num description indicates that the data are stored using double precision.

Now let’s generate a summary of the table.

summary(dat)
       YY             MM               DD             hh              mm          WDIR            WSPD     
 Min.   :2011   Min.   : 1.000   Min.   : 1.0   Min.   : 0.00   Min.   :50   Min.   :  1.0   Min.   : 0.0  
 1st Qu.:2012   1st Qu.: 3.000   1st Qu.: 8.0   1st Qu.: 5.00   1st Qu.:50   1st Qu.:143.0   1st Qu.: 3.9  
 Median :2012   Median : 6.000   Median :15.0   Median :11.00   Median :50   Median :213.0   Median : 6.2  
 Mean   :2012   Mean   : 6.234   Mean   :15.4   Mean   :11.49   Mean   :50   Mean   :234.3   Mean   :10.5  
 3rd Qu.:2012   3rd Qu.: 9.000   3rd Qu.:23.0   3rd Qu.:17.50   3rd Qu.:50   3rd Qu.:285.0   3rd Qu.: 8.9  
 Max.   :2012   Max.   :12.000   Max.   :31.0   Max.   :23.00   Max.   :50   Max.   :999.0   Max.   :99.0  
      GST             WVHT             DPD              APD              MWD           PRES             ATMP       
 Min.   : 0.00   Min.   : 0.260   Min.   : 2.600   Min.   : 3.120   Min.   :999   Min.   : 981.3   Min.   :-11.00  
 1st Qu.: 4.80   1st Qu.: 0.740   1st Qu.: 5.560   1st Qu.: 4.410   1st Qu.:999   1st Qu.:1009.2   1st Qu.:  5.60  
 Median : 7.40   Median : 1.070   Median : 7.140   Median : 5.000   Median :999   Median :1014.9   Median : 10.40  
 Mean   :11.81   Mean   : 1.624   Mean   : 7.982   Mean   : 5.551   Mean   :999   Mean   :1017.9   Mean   : 10.78  
 3rd Qu.:10.70   3rd Qu.: 1.580   3rd Qu.: 9.090   3rd Qu.: 5.830   3rd Qu.:999   3rd Qu.:1020.5   3rd Qu.: 16.80  
 Max.   :99.00   Max.   :99.000   Max.   :99.000   Max.   :99.000   Max.   :999   Max.   :9999.0   Max.   : 25.10  
      WTMP            DEWP              VIS          TIDE   
 Min.   : 5.40   Min.   :-16.000   Min.   :99   Min.   :99  
 1st Qu.: 7.10   1st Qu.:  0.900   1st Qu.:99   1st Qu.:99  
 Median :11.60   Median :  8.600   Median :99   Median :99  
 Mean   :12.17   Mean   :  8.268   Mean   :99   Mean   :99  
 3rd Qu.:16.30   3rd Qu.: 15.300   3rd Qu.:99   3rd Qu.:99  
 Max.   :24.30   Max.   :999.000   Max.   :99   Max.   :99  

You’ll note that many columns in our data contain values of 99 or 999. Such values are often placeholders for missing data. If not dealt with properly, these values will be interpreted as valid data by the software. One tell-tale sign that these values should be treated specially is their extreme values compared to the rest of the data batch. For example, the wave height field, WVHT, has values that average less than 2 meters so a value of 99 meters should be flagged as suspicious. In fact, NOAA’s documentation states that “Missing data … are denoted by … a variable number of 9’s … depending on the data type (for example: 999.0 99.0).”

We also note that some columns have nothing but missing values (e.g. VIS and TIDE) suggesting that either these variables are not measured at this particular buoy or the instruments did not function properly during the data collection period.

Before proceeding with any analysis, we need to address the missing values by flagging them as such or removing them all together.

Removing columns with no data

Let’s first remove all columns devoid of valid data. These columns are MWD, VIS and TIDE. There are many ways to do this. We will opt for subsetting the table and assigning this subset to a new data frame we will call dat1.

dat1 <- dat[ , !(names(dat) %in% c("MWD", "VIS", "TIDE"))]

Let’s break down this expression into its many components. names(dat) returns a list of column names in the order in which they appear in the object dat: YY, MM, DD, hh, mm, WDIR, WSPD, GST, WVHT, DPD, APD, MWD, PRES, ATMP, WTMP, DEWP, VIS, TIDE.

The matching operator %in% compares two sets of vectors and assesses if an element on the left-hand side of %in% is included in the elements on the right. For each element in the left hand set, R returns TRUE if the value is present in the right-hand set of values or FALSE if it is not. In our example, R is evaluating if the column names are included in the set of elements defined in c("MWD", "VIS", "TIDE"). The expression could be read as “… are any of the column names (i.e. YY, MM, DD, hh, mm, WDIR, WSPD, GST, WVHT, DPD, APD, MWD, PRES, ATMP, WTMP, DEWP, VIS, TIDE) included in the set ("MWD", "VIS", "TIDE")?” The output of this comparison is a series of TRUE’s and FALSE’s depending on whether or not a match is found.

A short video summary on using %in% can be found here.

names(dat) %in% c("MWD", "VIS", "TIDE")
 [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE  TRUE

The order in which the TRUE’s and FALSE’s are listed match the order of the column names. This order is important because this index is what R uses to identify which columns to return. But wait, R will return columns associated with a TRUE value! We want the reverse, so we add the “not” operator, !, in front of the expression thus flipping the TRUE/FALSE values.

!(names(dat) %in% c("MWD", "VIS", "TIDE"))
 [1]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE FALSE FALSE

So dat1 contains all the columns except MWD, VIS and TIDE.

head(dat1)
    YY MM DD hh mm WDIR WSPD  GST WVHT  DPD  APD   PRES ATMP WTMP DEWP
1 2011 12 31 23 50  316  4.8  6.0 0.92 5.56 4.88 1008.1  6.8  8.1  6.8
2 2012  1  1  0 50  296  5.5  6.3 0.89 5.56 4.60 1007.8  6.1  8.1  6.1
3 2012  1  1  1 50  311  8.4 10.0 0.87 8.33 4.09 1008.2  5.5  8.1  5.5
4 2012  1  1  2 50  314  8.3 10.0 0.93 3.45 4.05 1008.0  5.4  8.1  5.4
5 2012  1  1  3 50  313 11.1 13.0 1.12 3.85 3.88 1008.0  5.8  8.1  5.8
6 2012  1  1  4 50  319 11.1 13.3 1.25 4.76 4.06 1008.8  5.8  8.0  5.7

Boolean Operations

The NOT operator, !, is one of three Boolean operators you’ll be making good use of in this course; the other two being the AND operator, &, and the OR operator, |.

The following table demonstrates a few examples using the vectors a <- 3 and b <- 6.

Boolean operator Syntax Example Outcome

NOT

!

!(a == 3)

FALSE

AND

&

a == 3 & b == 1

FALSE

OR

|

a == 3 | b == 1

TRUE

The following table lists all possible Boolean outcomes:

Boolean operation Outcome

T & T

TRUE

T & F

FALSE

F & F

FALSE

T | T

TRUE

T | F

TRUE

F | F

FALSE

!T

FALSE

!F

TRUE

Note that the operation a == 2 | 4 is not the same as a == 2 | a == 4! The former will return TRUE whereas the latter will return FALSE. This is because the Boolean operator evaluates both sides of its expression as separate logical outcomes (i.e. T and F values). In the latter case, the Boolean expression is asking “is a equal to 2 OR is a equal to 4, since neither condition is true, the output will be FALSE OR FALSE which returns FALSE. In the former expression, the only condition being evaluated is a == 2, the value 4 just after boolean operator | is treated as a logical value; any non-zero value is interpreted as TRUE by the Boolean operation so a == 2 | 4 is really being evaluated as a == 2 | TRUE since the value of 4 is non-zero therefore TRUE. So always explicitly define each condition as a stand-alone expression as in a == 2 | a == 4.

Assigning NA to missing data

A short video summary on replacing values with NA can be found here.

It’s usually best to assign NA values the moment a data file is loaded into R. For example, had all NA values been flagged as -9999, we could have added the parameter na.strings="-9999" to the read.table() function. But our dataset contains missing values flagged as 99, 999 and 9999 requiring that we combine the missing string values such as na.strings = c("99", "999", "9999") however, this would have resulted in undesirable outcomes. For example, a wave height of 99 meters would clearly not make sense, so flagging such value as NA would be sensible, but a wind direction of 99 degrees would make sense and flagging such value as NA would not be appropriate. Had we set all values of 99 to NA the moment we loaded the file into R, we would have inadvertently removed valid values in the dataset. So flagging missing values as NA in our dataset will require column level intervention.

Let’s review the data summary to identify each column’s no data numeric designation (i.e. 99 or 999 or 9999) if present. Remember that we are now working off of the dat1 dataframe and not the original dat.

summary(dat1)
       YY             MM               DD             hh              mm          WDIR            WSPD     
 Min.   :2011   Min.   : 1.000   Min.   : 1.0   Min.   : 0.00   Min.   :50   Min.   :  1.0   Min.   : 0.0  
 1st Qu.:2012   1st Qu.: 3.000   1st Qu.: 8.0   1st Qu.: 5.00   1st Qu.:50   1st Qu.:143.0   1st Qu.: 3.9  
 Median :2012   Median : 6.000   Median :15.0   Median :11.00   Median :50   Median :213.0   Median : 6.2  
 Mean   :2012   Mean   : 6.234   Mean   :15.4   Mean   :11.49   Mean   :50   Mean   :234.3   Mean   :10.5  
 3rd Qu.:2012   3rd Qu.: 9.000   3rd Qu.:23.0   3rd Qu.:17.50   3rd Qu.:50   3rd Qu.:285.0   3rd Qu.: 8.9  
 Max.   :2012   Max.   :12.000   Max.   :31.0   Max.   :23.00   Max.   :50   Max.   :999.0   Max.   :99.0  
      GST             WVHT             DPD              APD              PRES             ATMP             WTMP      
 Min.   : 0.00   Min.   : 0.260   Min.   : 2.600   Min.   : 3.120   Min.   : 981.3   Min.   :-11.00   Min.   : 5.40  
 1st Qu.: 4.80   1st Qu.: 0.740   1st Qu.: 5.560   1st Qu.: 4.410   1st Qu.:1009.2   1st Qu.:  5.60   1st Qu.: 7.10  
 Median : 7.40   Median : 1.070   Median : 7.140   Median : 5.000   Median :1014.9   Median : 10.40   Median :11.60  
 Mean   :11.81   Mean   : 1.624   Mean   : 7.982   Mean   : 5.551   Mean   :1017.9   Mean   : 10.78   Mean   :12.17  
 3rd Qu.:10.70   3rd Qu.: 1.580   3rd Qu.: 9.090   3rd Qu.: 5.830   3rd Qu.:1020.5   3rd Qu.: 16.80   3rd Qu.:16.30  
 Max.   :99.00   Max.   :99.000   Max.   :99.000   Max.   :99.000   Max.   :9999.0   Max.   : 25.10   Max.   :24.30  
      DEWP        
 Min.   :-16.000  
 1st Qu.:  0.900  
 Median :  8.600  
 Mean   :  8.268  
 3rd Qu.: 15.300  
 Max.   :999.000  

It appears that the fields with missing values flagged as 99 are WSPD, GST, WVHT, DPD and APD; fields with missing values flagged as 999 are WDIR and DEWP; and the field with missing values flagged as 9999 is PRES. We will convert these values to missing values in the following chunk of code.

# Create list of columns with missing values
na99   <- c("WSPD", "GST", "WVHT", "DPD",  "APD")
na999  <- c("WDIR",  "DEWP", "ATMP", "WTMP")
na9999 <- ("PRES")

# Replace missing values with NA
dat1[,na99][dat1[, na99] == 99] <- NA
dat1[,na999][dat1[, na999] == 999] <- NA
dat1[,na9999][dat1[, na9999] == 9999] <- NA

There are two components to the last three lines of code: The first is a subset of columns for which a particular missing value is designated (e.g. dat1[,na99] ), the second is an evaluation statement identifying which values in the column subset is flagged as missing (e.g. [dat1[, na99] == 99]). Together, these elements define the indices (table cells) for which a value is replaced with NA (e.g. <- NA).

Another way this could have been accomplished (and one you might feel more comfortable with) would have been to work off of one column at a time, for example:

dat1$WSPD[dat1$WSPD == 99]   <- NA
dat1$GST[dat1$WSPD == 99]    <- NA
...
dat1$WDIR[dat1$WDIR == 999]  <- NA
...
dat1$PRES[dat1$PRES == 9999] <- NA

Let’s look at a summary of our table:

summary(dat1)
       YY             MM               DD             hh              mm          WDIR            WSPD       
 Min.   :2011   Min.   : 1.000   Min.   : 1.0   Min.   : 0.00   Min.   :50   Min.   :  1.0   Min.   : 0.000  
 1st Qu.:2012   1st Qu.: 3.000   1st Qu.: 8.0   1st Qu.: 5.00   1st Qu.:50   1st Qu.:136.0   1st Qu.: 3.800  
 Median :2012   Median : 6.000   Median :15.0   Median :11.00   Median :50   Median :208.0   Median : 6.000  
 Mean   :2012   Mean   : 6.234   Mean   :15.4   Mean   :11.49   Mean   :50   Mean   :198.2   Mean   : 6.315  
 3rd Qu.:2012   3rd Qu.: 9.000   3rd Qu.:23.0   3rd Qu.:17.50   3rd Qu.:50   3rd Qu.:273.0   3rd Qu.: 8.500  
 Max.   :2012   Max.   :12.000   Max.   :31.0   Max.   :23.00   Max.   :50   Max.   :360.0   Max.   :18.800  
                                                                             NA's   :377     NA's   :377     
      GST              WVHT            DPD              APD              PRES             ATMP             WTMP      
 Min.   : 0.000   Min.   :0.260   Min.   : 2.600   Min.   : 3.120   Min.   : 981.3   Min.   :-11.00   Min.   : 5.40  
 1st Qu.: 4.700   1st Qu.:0.740   1st Qu.: 5.560   1st Qu.: 4.410   1st Qu.:1009.2   1st Qu.:  5.60   1st Qu.: 7.10  
 Median : 7.200   Median :1.060   Median : 7.140   Median : 5.000   Median :1014.9   Median : 10.40   Median :11.60  
 Mean   : 7.684   Mean   :1.285   Mean   : 7.664   Mean   : 5.225   Mean   :1014.6   Mean   : 10.78   Mean   :12.17  
 3rd Qu.:10.200   3rd Qu.:1.570   3rd Qu.: 9.090   3rd Qu.: 5.820   3rd Qu.:1020.5   3rd Qu.: 16.80   3rd Qu.:16.30  
 Max.   :23.300   Max.   :8.650   Max.   :17.390   Max.   :10.430   Max.   :1040.2   Max.   : 25.10   Max.   :24.30  
 NA's   :377      NA's   :29      NA's   :29       NA's   :29       NA's   :3                                        
      DEWP        
 Min.   :-16.000  
 1st Qu.:  0.900  
 Median :  8.600  
 Mean   :  7.793  
 3rd Qu.: 15.300  
 Max.   : 23.900  
 NA's   :4        

Note the NA’s in the summary output. We now have a clean dataframe to work with.

Removing rows with NA values

Your analysis might require that all rows in a table have complete cases (i.e. no missing values). You can use the na.omit function to return just the rows that have complete cases. We will create a synthetic example to demonstrate the na.omit() function.

d <- data.frame(x = c(1,4,2,5,2,3,NA), 
                y = c(3,2,5,3,8,1,1), 
                z = c(NA,NA,4,9,7,8,3))
d
   x y  z
1  1 3 NA
2  4 2 NA
3  2 5  4
4  5 3  9
5  2 8  7
6  3 1  8
7 NA 1  3

Note the NA values in the dataframe. Now let’s create a new dataframe that we’ll name d.complete that only stores the rows for which we have complete cases (i.e. no NA values).

d.complete <- na.omit(d)
d.complete
  x y z
3 2 5 4
4 5 3 9
5 2 8 7
6 3 1 8

All rows with at least one NA value have been removed. Note that the table’s dimension has changed from seven rows to four.

Checking if a value is NA

When assessing if a value is equal to NA the following evaluation may behave unexpectedly.

a <- c (3, 67, 4, NA, 10)
a == NA
[1] NA NA NA NA NA

The output is not a logical datatype we would expect from an evaluation. Instead, you must make use of the is.na() function:

is.na(a)
[1] FALSE FALSE FALSE  TRUE FALSE

As another example, if we want to keep all rows in dataframe d where z = NA, we would type:

d[ is.na(d$z), ]
  x y  z
1 1 3 NA
2 4 2 NA

You can, of course, use the ! operator to reverse the evaluation and omit all rows where z = NA,

d[ !is.na(d$z), ]
   x y z
3  2 5 4
4  5 3 9
5  2 8 7
6  3 1 8
7 NA 1 3

Is there an easier way?

Cleaning data tables is an important component of any analytical workflow, but the syntax can get complicated and difficult to follow. Fortunately, there is an easier way to clean data tables using the dplyr package which will be covered in Week 3.