Data types

R objects can store values as different data types including numeric (both integer and double), character, logical or date. R variable types are often referred to as modes and can be identified using the mode() or typeof() functions.

Numeric

The numeric data type is probably the simplest. It consists of numbers such as integers (e.g. 1 ,-3 ,33 ,0) or doubles (e.g. 0.3, 12.4, -0.04, 1.0). For example, to create a numeric (double) vector we can type:

x <- c(1.0, -3.4, 2, 140.1)
mode(x)
[1] "numeric"

To assess if the number is stored as an integer or a double use the typeof() function.

typeof(x)
[1] "double"

Note that removing the fraction part of a number when creating a numeric object does not necessarily create an integer. For example, creating what seems to be an integer object returns double when queried by typeof():

x <- 4
typeof(x)
[1] "double"

To force R to recognize a value as an integer add an upper case L to the number.

x <- 4L
typeof(x)
[1] "integer"

Character

The character data type consists of letters or words such as "a", "f", "project", "house value".

x <- c("a", "f", "project", "house value")
mode(x)
[1] "character"

Characters can also consist of numbers represented as characters. The distinction between a character representation of a number and a numeric one is important. For example, if we have two numeric vectors x and y such as

x <- 3
y <- 5.3

and we choose to sum the two variables, we get:

x + y
[1] 8.3

If we repeat these steps but instead choose to represent the numbers 3 and 5.3 as characters we get the following error message:

x <- "3"
y <- "5.3"
x + y
Error in x + y: non-numeric argument to binary operator

Note the use of quotes to force numbers to character mode.

Logical

Logical values can take on one of two values: TRUE or FALSE. These can also be represented as 1 or 0. For example, to create a logical vector of 4 elements, you can type

x <- c(TRUE, FALSE, FALSE, TRUE)

or

x <- as.logical(c(1,0,0,1))

Note that in both cases, mode(x) returns logical. Also note that the 1’s and 0’s in the last example are converted to TRUE’s and FALSE’s.

Factor

Factors are normally used to group variables into a fixed number of unique categories or levels. For example, a dataset may be grouped by gender or month of the year. Such data are usually loaded into R as a numeric or character data type requiring that they be converted to a factor using the as.factor() function.

In the following example, a dataframe (dataframes are data tables–this will be covered later in this tutorial) is created with a column, gender, storing one of three values: M, F and U. These values are first encoded as character objects, but are then converted to factors using the as.factor function.

a      <- c("M", "F", "F", "U", "F", "M", "M", "M", "F", "U")
a.fact <- as.factor(a)
x      <- c(166, 47, 61, 148, 62, 123, 232, 98, 93, 110)
dat    <- data.frame(x = x, gender = a.fact)
dat
     x gender
1  166      M
2   47      F
3   61      F
4  148      U
5   62      F
6  123      M
7  232      M
8   98      M
9   93      F
10 110      U

The gender column is now a factor with three levels: F, M and U.

str(dat)
'data.frame':   10 obs. of  2 variables:
 $ x     : num  166 47 61 148 62 123 232 98 93 110
 $ gender: Factor w/ 3 levels "F","M","U": 2 1 1 3 1 2 2 2 1 3

You can also use the levels function to identify the unique levels in a factor:

levels(a.fact)
[1] "F" "M" "U"

Many functions recognize factor data types and will allow you to split the output into groups defined by the factor’s unique levels. For example, to create three boxplots of the value x, one for each gender group F, M and U, type the following:

boxplot(x ~ gender, dat, horizontal = TRUE)

The tilde ~ operator is used in the plot function to split (or condition) the data into separate plots based on the factor gender.

Rearranging level order

A factor will define a hierarchy for its levels. When we invoked the levels function in the last example, you may have noted that the levels output were ordered F, M andU–this is the level hierarchy defined for gender (i.e. F>M>U ). This means that regardless of the order in which the factors appear in a table, anytime a plot or operation is conditioned by the factor the grouped elements will appear in the order defined by the levels’ hierarchy. When we created the boxplot from our dat object, the plotting function ordered the boxplot (bottom to top) following gender’s level hierarchy (i.e. F first, then M, then U).

If we wanted the boxplots to be plotted in a different order (i.e. U first followed by F then M) we would need to modify the gender column by recreating the factor object as follows:

dat$gender <- factor(dat$gender, levels=c("U","F","M"))
str(dat)
'data.frame':   10 obs. of  2 variables:
 $ x     : num  166 47 61 148 62 123 232 98 93 110
 $ gender: Factor w/ 3 levels "U","F","M": 3 2 2 1 2 3 3 3 2 1

The factor function is given the original factor values but is also given the levels in the new order in which they are to appear. Now, if we recreate the boxplot, the plot order (plotted from bottom to top) will reflect the new level hierarchy.

boxplot(x ~ gender, dat, horizontal = TRUE)

Subsetting table by level

We can subset the table by level using the subset function. For example, to subset the values associated with F and M, type:

dat.f <- subset(dat, gender == "F" | gender == "M")
dat.f
    x gender
1 166      M
2  47      F
3  61      F
5  62      F
6 123      M
7 232      M
8  98      M
9  93      F

The double equality sign == differs from the single equality sign = in that the former asses a condition: it checks if the variable to the left of == equals the variable to the right.

However, if you display the levels associated with this new dataframe, you’ll see that level U is still listed even though it no longer appears in the gender column.

levels(dat.f$gender)
[1] "U" "F" "M"

This can be a nuisance when plotting the data subset.

boxplot(x ~ gender, dat.f, horizontal = TRUE)

Even though no records are available for U, the plot function allocates a slot for that level. To resolve this, we can use the droplevels function to remove all unused levels.

dat.f$gender <- droplevels(dat.f$gender)
levels(dat.f$gender)
[1] "F" "M"

Date

Date values can be represented as numbers or characters. But to be properly interpreted as a date object in R, values must be stored as an R date object. R provides many facilities to convert and manipulate dates and times, but a package called lubridate makes working with dates/times much easier. A separate section is dedicated to the creation and manipulation of date objects.

NA and NULL

You will find that many data files contain missing or unknown values. It may be tempting to assign these missing or unknown values a 0 but doing so can lead to many undesirable results when analyzing the data. R has two placeholders for such elements: NA and NULL.

For example, let’s say that we made four measurements where the second measurement was not available but we wanted that missing value to be recorded in our table, we would encode that missing value as follows:

x <- c(23, NA, 1.2, 5)

NA (Not Available) is a missing value indicator. It suggests that a value should be present but is unknown.

The NULL object also represents missing values but its interpretation is slightly different in that it suggests that the value does not exist or that it’s not measurable.

y <- c(23, NULL, 1.2, 5)

The difference between NA and NULL may seem subtle, but their interpretation in some functions can lead to different outcomes. For example, when computing the mean of x, R returns an NA value:

mean(x)
[1] NA

This serves as a check to remind the user that one of the elements is missing. This can be overcome by setting the na.rm parameter to TRUE as in mean(x, na.rm=T) in which case R ignores the missing value.

A NULL object, on the other hand, is treated differently. Since NULL implies that a value should not be present, R no longer feels the need to treat such element as questionable and allows the mean value to be computed:

mean(y)
[1] 9.733333

It’s more common to find data tables with missing elements populated with NA’s than NULL’s so unless you have a specific reason to use NULL as a missing value placeholder, use NA instead.

Data structures

A data type defines a single element (e.g. a single number or a single date), but most datasets we work with consist of many elements such as a table of temperature values or a list of survey results. These datasets are stored in R in one of several data structures such as (atomic) vectors, matrices, data frames and lists.

(Atomic) Vectors

The atomic vector (or vector for short) is the simplest data structure in R which consists of an ordered set of values of the same type (e.g. numeric, character, date, etc…). A vector can be created using the combine function c() as in

x <- c(674 , 4186 , 5308 , 5083 , 6140 , 6381)
x
[1]  674 4186 5308 5083 6140 6381

A vector object is an indexable collection of values which allows one to access a specific index number. For example, to access the third element of x, type

x[3]
[1] 5308

You can also select a subset of elements by index values using the combine function c(),

x[c(1,3,4)]
[1]  674 5308 5083

Or, if you are interested in a range of indexed values, use the : operator,

x[2:4]
[1] 4186 5308 5083

You can also assign new values to a specific index. For example, we can replace the second value in vector x with 0,

x[2] <- 0
x
[1]  674    0 5308 5083 6140 6381

Note that a vector can store any data type such as characters or strings,

x <- c("all", "b", "olive")
x
[1] "all"   "b"     "olive"

However, a vector can only be of one type. For example, you cannot mix numeric and character types as follows:

x <- c( 1.2, 5, "Rt", "9-16-2000")

In such a situation, R will convert the element types to the highest common mode following the order NULL < logical < integer < double < character. In our working example, the elements are coerced to character:

mode(x)
[1] "character"

Matrices and arrays

Matrices in R can be thought of as vectors indexed using two indices instead of one. For example, the following line of code creates a 3 by 3 matrix of randomly generated values. The parameters nrow and ncol define the matrix 2D dimension and the function runif() generates the nine random numbers that populate this matrix.

m <- matrix(runif(9,0,10), nrow = 3, ncol = 3)
m
          [,1]     [,2]      [,3]
[1,] 3.2619298 5.087999 1.4707656
[2,] 7.6767365 3.855925 9.0455366
[3,] 0.7962403 5.457768 0.7772612

If a higher dimension vector is desired, then use the array() function to generate the n-dimensional object. A 3x3x3 array can be created as follows:

m <- array(runif(27,0,10), c(3,3,3))
m
, , 1

           [,1]      [,2]      [,3]
[1,] 4.96707652 6.1121388 2.4113841
[2,] 0.07456467 0.2300082 0.8561746
[3,] 0.43415909 8.0016413 4.2947824

, , 2

          [,1]     [,2]     [,3]
[1,] 2.4664150 1.918241 7.188041
[2,] 8.7000401 5.673075 3.452806
[3,] 0.8046015 9.799181 1.755217

, , 3

          [,1]       [,2]      [,3]
[1,] 1.4170178 0.49012164 5.4418602
[2,] 1.1843084 0.02363908 0.1182934
[3,] 0.4777543 8.30753084 2.9205772

Matrices and arrays can store numeric or character data types, but they cannot store both. This is not to say that you can’t have a matrix of the kind

     [,1] [,2]
[1,] "a"  "2" 
[2,] "b"  "4" 

but the value 2 and 4 are no longer treated as numeric values but as character values instead.

Data frames

A data frame is what comes closest to our perception of a data table. It’s an extension of the matrix object in that, unlike a matrix, a data frame can mix data types across columns (e.g. both numeric and character columns a can coexist in a data frame) but data type remains the same across rows.

Name   <- c("a1", "a2", "b3")
Value1 <- c(23, 4, 12)
Value2 <- c(1,45,5)
dat    <- data.frame(Name, Value1, Value2)
dat
  Name Value1 Value2
1   a1     23      1
2   a2      4     45
3   b3     12      5

To view each column’s mode type:

str(dat)
'data.frame':   3 obs. of  3 variables:
 $ Name  : Factor w/ 3 levels "a1","a2","b3": 1 2 3
 $ Value1: num  23 4 12
 $ Value2: num  1 45 5

Like a vector, elements of a data frame can be accessed by their index (aka subscripts). The first index represents the row number and the second index represents the column number. For example, to list the second row of the third column, type:

dat[2, 3]
[1] 45

If you wish to list all rows for columns one through two leave the first index blank:

dat[ , 1:2 ]
  Name Value1
1   a1     23
2   a2      4
3   b3     12

or the third row for all columns, leave the second index blank:

dat[ 3 , ]
  Name Value1 Value2
3   b3     12      5

You can also reference columns by their names if you append the $ character to the dataframe object name. For example, to list the values in the column named Value2, type:

dat$Value2
[1]  1 45  5

Lists

A list is an ordered set of components stored in a 1D structure. In fact, it’s another kind of vector called a recursive vector where each element can be of different data type and structure. This implies that each element of a list can hold complex objects such as matrices, data frames and other list objects too! Think of a list as a single column spreadsheet where each cell stores anything from a number to a three paragraph sentence to a five column table.

A list is constructed using the list() function. For example, the following list consists of 3 components: a two-column data frame (tagged A), a two element logical vector (tagged B) and a three element character vector (tagged D).

A <- data.frame(
     x = c(7.3, 29.4, 29.4, 2.9, 12.3, 7.5, 36.0, 4.8, 18.8, 4.2),
     y = c(5.2, 26.6, 31.2, 2.2, 13.8, 7.8, 35.2, 8.6, 20.3, 1.1)
     )
B <- c(TRUE, FALSE)
D <- c("apples", "oranges", "round")

lst <- list(A = A, B = B, D = D)

You can view each component’s structure using the str() function.

str(lst)
List of 3
 $ A:'data.frame':  10 obs. of  2 variables:
  ..$ x: num [1:10] 7.3 29.4 29.4 2.9 12.3 7.5 36 4.8 18.8 4.2
  ..$ y: num [1:10] 5.2 26.6 31.2 2.2 13.8 7.8 35.2 8.6 20.3 1.1
 $ B: logi [1:2] TRUE FALSE
 $ D: chr [1:3] "apples" "oranges" "round"

Each component of a list can be extracted using the $ symbol followed by that component’s tag (the name that was assigned to each component when we constructed the list object). For example, to access component A from list lst, type:

lst$A
      x    y
1   7.3  5.2
2  29.4 26.6
3  29.4 31.2
4   2.9  2.2
5  12.3 13.8
6   7.5  7.8
7  36.0 35.2
8   4.8  8.6
9  18.8 20.3
10  4.2  1.1

You can also access that same component using its numerical index. Since A is the first component in lst, its numerical index is 1.

lst[[1]]
      x    y
1   7.3  5.2
2  29.4 26.6
3  29.4 31.2
4   2.9  2.2
5  12.3 13.8
6   7.5  7.8
7  36.0 35.2
8   4.8  8.6
9  18.8 20.3
10  4.2  1.1

Note that we are using double brackets to extract A. In doing so, we are extracting A in its native data format (a data frame in our example). Had we used single brackets, A would have been extracted as a list regardless of its native format. The following compares the different data structure outputs between single and double bracketed indices:

class(lst[[1]])
[1] "data.frame"
class(lst[1])
[1] "list"

The tag names for each component in a list can be extracted using the names() function:

names(lst)
[1] "A" "B" "D"

Note that lists do not require tag names for its components. For example, we could have created a list as follows (note the omission of A=, B=, etc…):

lst.notags <- list(A, B, D)

Listing its contents displays bracketed indices instead of tag names.

lst.notags
[[1]]
      x    y
1   7.3  5.2
2  29.4 26.6
3  29.4 31.2
4   2.9  2.2
5  12.3 13.8
6   7.5  7.8
7  36.0 35.2
8   4.8  8.6
9  18.8 20.3
10  4.2  1.1

[[2]]
[1]  TRUE FALSE

[[3]]
[1] "apples"  "oranges" "round"  

When lists do not have tags, the names() function will return NULL.

names(lst.notags)
NULL

It’s usually good practice to assign tags to list components as these can provide meaningful descriptions of each element.

You’ll find that many functions in R return list objects such as the linear regression model function lm. For example, run a regression analysis for vector elements x and y (in data frame A) and save the output of the regression analysis to an object called M:

M <- lm( y ~ x, A)

Now let’s verify M’s data structure:

typeof(M)
[1] "list"

The contents of the regression model object M consists of 12 components which include various diagnostic statistics, regression coefficients, and residuals. Let’s extract each component’s tag (i.e. element name):

names(M)
 [1] "coefficients"  "residuals"     "effects"       "rank"          "fitted.values" "assign"        "qr"           
 [8] "df.residual"   "xlevels"       "call"          "terms"         "model"        

Fortunately, the regression function assigns descriptive tag names to each of its components making it easier to figure out what most of these components represent. For example, it’s clear that the residuals tag stores the residual values from the regression model.

M$residuals
         1          2          3          4          5          6          7          8          9         10 
-2.1119291 -2.6122265  1.9877735 -0.7516888  1.5332525  0.2898782 -0.5525869  3.7654802  1.5919885 -3.1399416 

The list M is more complex than the simple list lst we created earlier. In addition to having more elements, it stores a wider range of data type and structure. For example, element qr is itself a list of five elements!

str(M$qr)
List of 5
 $ qr   : num [1:10, 1:2] -3.162 0.316 0.316 0.316 0.316 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:10] "1" "2" "3" "4" ...
  .. ..$ : chr [1:2] "(Intercept)" "x"
  ..- attr(*, "assign")= int [1:2] 0 1
 $ qraux: num [1:2] 1.32 1.44
 $ pivot: int [1:2] 1 2
 $ tol  : num 0.0000001
 $ rank : int 2
 - attr(*, "class")= chr "qr"

So if we want to access the element rank in the component qr of list M, we can type:

M$qr$rank
[1] 2

If we want to access rank using indices instead, and noting that qr is the 7th component in list M and that rank is the 5th element in list qr we type:

M[[7]][[5]]
[1] 2

This should illustrate the value in assigning tags to list elements; not only does the double brackets clutter the expression, but finding the element numbers can be daunting in a complex list structure.

Coercing data from one type to another

Data can be coerced from one type to another. For example, to coerce the following vector object from character to numeric, use the as.numeric() function.

y   <- c("23.4", "6", "100.01","6")
y.c <- as.numeric(y)
y.c
[1]  23.40   6.00 100.01   6.00

The as.numeric function forces the vector to a double (you could have also used the as.double function).

To convert a number to a character use as.character().

numchar <- as.character(y.c)
numchar
[1] "23.4"   "6"      "100.01" "6"     

You can also coerce a number or character to a factor.

numfac <- as.factor(y)
numfac
[1] 23.4   6      100.01 6     
Levels: 100.01 23.4 6
charfac <- as.factor(y.c)
charfac
[1] 23.4   6      100.01 6     
Levels: 6 23.4 100.01

It’s not uncommon for R’s file reading functions to create factors from character variables (columns). If you need to coerce a factor back to a character use the as.character() function.

char <- as.character(charfac)
char
[1] "23.4"   "6"      "100.01" "6"     

Numbers stored as factors can also be coerced back to numbers, but be careful, the following will not produce the expected output:

num <- as.numeric(numfac)
num
[1] 2 3 1 3

The output does not look like a numeric representation of the original elements in numfac. This is because factors store values as integers where each integer is a pointer to one of the unique factor levels. Here, we have three unique factor levels, hence three unique integer levels. Level number 1 repesents 100.01, level number 2 represents 23.4, and so on. To extract the numbers, you must first convert the factors to characters before converting to numbers.

num <- as.numeric(as.character (numfac))
num
[1]  23.40   6.00 100.01   6.00

A summary of coercing functions follows:

as.character()

Convert to character

as.numeric() or as.double()

Convert to double

as.integer()

Convert to integer

as.factor()

Convert to factor

as.logical()

Convert to a Boolean