Numbers and dates are not the only variable types we might be interested in exploring. We often find ourselves having to manipulate character (text) objects as well. In the programming environment, such queries are often referred to as string searches. String queries may involve assessing if a variable matches or contains an exact set of characters; it can also involve extracting a certain set of characters given some pattern. R has a very capable set of string operations built into its environment however, many find it difficult to master. A package that will be used in this tutorial that simplifies this task is called
stringr. A write-up of its capabilities can be found here.
This is the simplest string operation one can perform. It involves assessing if a variable is equal (or not) to a complete text string.
We’ve already seen how the conditional statements can be used to check whether a variable is equal to, less than or greater than a number. We can use conditional statements to evaluate if a variable matches an exact string. For example, the following chunk of code returns
TRUE since the strings match exactly.
<- "Abc def" a == "Abc def"a
However, note that R differentiates cases so that the following query, returns
FALSE since the first character does not match in case (i.e. upper case
A vs. lower case
== "abc def"a
If you want R to ignore cases in any string operations, simply force all variables to a lower case and define the pattern being compared against in lower case. For example:
tolower(a) == "abc def"
To check if object
a has the pattern
"c d" (note the space in between the letters) anywhere in its string, use
str_detect function as follows:
library(stringr) str_detect(a, "c d")
The following example compares the string to
"cd" (note the omission of the space):
To check if object
a starts with the pattern
"c d" add the carat character
^ in front of the pattern as in:
str_detect(a, "^c d")
To check if object
a ends with the pattern
"Abc" add the dollar character
$ to the end of the pattern as in:
If you want to find where a particular pattern lies in a string, use the
str_locate function. For example, to find where the pattern
"c d" occurs in object
str_locate(a, "c d")
start end [1,] 3 5
The function returns two values: the position in the string where the pattern starts (e.g. position 3) and the position where the pattern ends (e.g. position 5 )
Note that if the pattern is not found,
start end [1,] NA NA
Note too that the
str_locate function only returns the position of the first occurrence. For example, the following chunk will only return the start/end positions of the first occurrence of
<- "Abc def Abg" b str_locate(b, "Ab")
start end [1,] 1 2
To find all occurrences, use the
str_locate_all() function as in:
[] start end [1,] 1 2 [2,] 9 10
The function returns a
list object. To extract the position values into a dateframe, simply wrap the function in a call to
as.data.frame, for example:
<- as.data.frame(str_locate_all(b,"Ab")) str.pos str.pos
start end 1 1 2 2 9 10
str_locate_all returns a list and not a matrix or a data frame can be understood in the following example:
# Create a 5 element string vector <- c("Abc", "Def ", "Abc Def Ab", " bc ", "ef ") d # Search for all instances of "Ab" str_locate_all(d,"Ab")
[] start end [1,] 1 2 [] start end [] start end [1,] 1 2 [2,] 9 10 [] start end [] start end
d is a five element string vector (so far we’ve worked with single element vectors). The
str_locate_all function returns a result for each element of that vector, and since patterns can be found multiple times in a same vector element, the output can only be conveniently stored in a list.
A natural extension to finding the positions of patterns in a text is to find the string’s total length. This can be accomplished with the
For a multi-element vector, the output looks like this:
 3 4 10 4 3
To find out how often the pattern
Ab occurs in each element of object
d, use the
 1 0 2 0 0
str_pad() function can be used to pad numbers with leading zeros. Note that in doing so, you are creating a character object from a numeric object.
<- c(12, 2, 503, 20, 0) e str_pad(e, width=3, side="left", pad = "0" )
 "012" "002" "503" "020" "000"
You can append strings with custom text using the
str_c() functions. For example, to add the string length at the end of each vector element in
str_c(d, " has ", str_length(d), " characters" )
 "Abc has 3 characters" "Def has 4 characters" "Abc Def Ab has 10 characters"  " bc has 4 characters" "ef has 3 characters"
You can remove leading or ending (or both) white spaces from a string. For example, to remove leading white spaces from object
<- str_trim(d, side="left")d.left.trim
Now let’s compare the original to the left-trimmed version:
 3 4 10 4 3
 3 4 10 3 3
To remove trailing spaces set
side = "right" and to remove both leading and trailing spaces set
side = "both".
To replace all instances of a specified set of characters in a string with another set of characters, use the
str_replace_all() function. For example, to replace all spaces in object
b with dashes, type:
str_replace_all(b, " ", "-")
To find the character elements of a vector at a given position of a given string, use the
str_sub() function. For example, to find the characters between positions two and five (inclusive) type:
str_sub(b, start=2, end=5)
 "bc d"
If you don’t specify a
start position, then all characters up to and including the
end position will be returned. Likewise, if the
end position is not specified then all characters from the
start position to the end of the string will be returned.
If you want to break a string up into individual components based on a character delimiter, use the
str_split() function. For example, to split the following string into separate elements by comma, type the following:
<- "Year:2000, Month:Jan, Day:23" g str_split(g, ",")
[]  "Year:2000" " Month:Jan" " Day:23"
The output is a one element list. If object
g consists of more than one element, the output will be a list of as many elements as there are
Depending on your workflow, you may need to convert the
str_split output to an atomic vector. For example, if you want to find an element in the above
str_split output that matches the string
Year:2000, the following will return
FALSE and not
TRUE as expected:
"Year:2000" %in% str_split(g, ",")
The workaround is to convert the right-hand output to a single vector using the
"Year:2000" %in% unlist(str_split(g, ","))
If you are applying the split function to a column of data from a dataframe, you will want to use the function
str_split_fixed instead. This function assumes that the number of components to be extracted via the split will be the same for each vector element. For example, the following vector,
T1, has two time components that need to be extracted. The separator is a dash,
<- c("9:30am-10:45am", "9:00am- 9:50am", "1:00pm- 2:15pm") T1 T1
 "9:30am-10:45am" "9:00am- 9:50am" "1:00pm- 2:15pm"
str_split_fixed(T1, "-", 2)
[,1] [,2] [1,] "9:30am" "10:45am" [2,] "9:00am" " 9:50am" [3,] "1:00pm" " 2:15pm"
The third parameter in the
str_split_fixed function is the number of elements to return which also defines the output dimension (here, a three row and two column table). If you want to extract both times to separate vectors, reference the columns by index number:
<- str_split_fixed(T1, "-", 2)[ ,1] T1.start T1.start
 "9:30am" "9:00am" "1:00pm"
<- str_split_fixed(T1, "-", 2)[ ,2] T1.end T1.end
 "10:45am" " 9:50am" " 2:15pm"
You will want to use the indexes if you are extracting strings in a data frame. For example:
<- data.frame( Time = c("9:30am-10:45am", "9:00am-9:50am", "1:00pm-2:15pm")) dat $Start_time <- str_split_fixed(dat$Time, "-", 2)[ , 1] dat$End_time <- str_split_fixed(dat$Time, "-", 2)[ , 2] datdat
Time Start_time End_time 1 9:30am-10:45am 9:30am 10:45am 2 9:00am-9:50am 9:00am 9:50am 3 1:00pm-2:15pm 1:00pm 2:15pm
To extract the three letter months from object
g (defined in the last example), you can use a combination of
stringr functions as in:
<- str_locate(g, "Month:") loc str_sub(g, start = loc[,"end"] + 1, end = loc[,"end"]+3)
The above chunk of code first identifies the position of the
Month: string and passes its output to the object
loc (a matrix). It then uses the
end position in the call to
str_sub to extract the three characters making up the month abbreviation. The value
1 is added to the
start parameter in
str_sub to omit the last character of
Month: (recall that the
str_locate positions are inclusive).
This can be extend to multi-element vectors as follows:
# Note the differences in spaces and string lenghts between the vector # elements. <- c("Year:2000, Month:Jan, Day:23", gs "Year:345, Month:Mar, Day:30", "Year:1867 , Month:Nov, Day:5") <- str_locate(gs, "Month:") loc str_sub(gs, start = loc[,"end"] + 1, end = loc[,"end"]+3)
 "Jan" "Mar" "Nov"
Note the non-uniformity in each element’s length and
Month: position which requires that we explicitly search for the
Month: string position in each element. Had all elements been of equal length and format, we could have simply assigned the position numbers in the call to