6  Working with string objects

stringr
1.5.1

Numbers and dates are not the only variable types we might be interested in exploring or manipulating. Character (text) objects can be part of a data query operation as well. In the programming environment, such queries are often referred to as string searches. String queries may involve assessing if a variable matches or contains an exact set of characters; it can also involve extracting a certain set of characters given some pattern. R has a very capable set of string operations built into its environment however, many find it difficult to master. A package that will be used in this tutorial that simplifies this task is called stringr.

 

6.1 Finding patterns in a string

  

6.1.1 Checking for an exact string match

This is the simplest string operation one can perform. It involves assessing if a variable is equal (or not) to a complete text string.

We’ve already seen how the conditional statements can be used to check whether a variable is equal to, less than or greater than a number. We can use conditional statements to evaluate if a variable matches an exact string. For example, the following chunk of code returns TRUE since the strings match exactly.

a <- "Abc def"
a == "Abc def"
[1] TRUE

However, note that R differentiates cases so that the following query, returns FALSE since the first character does not match in case (i.e. upper case A vs. lower case a).

a == "abc def"
[1] FALSE

6.1.1.1 How to ignore case sensitivity?

If you want R to ignore cases in any string operations, simply force all variables to a lower case and define the pattern being compared against in lower case. For example:

tolower(a) == "abc def"
[1] TRUE

6.1.2 Checking for a partial match

6.1.2.1 Matching anywhere in the string

To check if object a has the pattern "c d" (note the space in between the letters) anywhere in its string, use stringr’s str_detect function as follows:

library(stringr)
str_detect(a, "c d")
[1] TRUE

The following example compares the string to "cd" (note the omission of the space):

str_detect(a, "cd")
[1] FALSE

6.1.2.2 Matching the begining of the string

To check if object a starts with the pattern "c d" add the carat character ^ in front of the pattern as in:

str_detect(a, "^c d")
[1] FALSE

6.1.2.3 Matching the end of the sring

To check if object a ends with the pattern "Abc" add the dollar character $ to the end of the pattern as in:

str_detect(a, "Abc$")
[1] FALSE

6.1.3 Matching against a list of characters

If you want to match against a list of characters, you may want to make use of the matching, %in%, operator. This is a built-in function and thus not part of the stringr package. For example, to check if any of the characters "Jun" or "Sep" exist in object b, type:

b <- c("Jan", "Feb", "Jun", "Oct")
b %in% c("Jun", "Sep")
[1] FALSE FALSE  TRUE FALSE

The output will generate a boolean value for each element in b.

The %in% operator is described in greater detail in the next chapter.

6.1.4 Locating the position of a pattern in a string

If you want to find where a particular pattern lies in a string, use the str_locate function. For example, to find where the pattern "c d" occurs in object a type:

str_locate(a, "c d")
     start end
[1,]     3   5

The function returns two values: the position in the string where the pattern starts (e.g. position 3) and the position where the pattern ends (e.g. position 5 )

Note that if the pattern is not found, str_locate returns NA’s:

str_locate(a, "cd")
     start end
[1,]    NA  NA

Note too that the str_locate function only returns the position of the first occurrence. For example, the following chunk will only return the start/end positions of the first occurrence of Ab.

b <- "Abc def Abg"
str_locate(b, "Ab")
     start end
[1,]     1   2

To find all occurrences, use the str_locate_all() function as in:

str_locate_all(b,"Ab")
[[1]]
     start end
[1,]     1   2
[2,]     9  10

The function returns a list object. To extract the position values into a dateframe, simply wrap the function in a call to as.data.frame, for example:

str.pos <- as.data.frame(str_locate_all(b,"Ab"))
str.pos
  start end
1     1   2
2     9  10

The reason str_locate_all returns a list and not a matrix or a data frame can be understood in the following example:

# Create a 5 element string vector
d <- c("Abc", "Def ", "Abc Def Ab", " bc ", "ef ")

# Search for all instances of "Ab"
str_locate_all(d,"Ab")
[[1]]
     start end
[1,]     1   2

[[2]]
     start end

[[3]]
     start end
[1,]     1   2
[2,]     9  10

[[4]]
     start end

[[5]]
     start end

Here, d is a five element string vector (so far we’ve worked with single element vectors). The str_locate_all function returns a result for each element of that vector, and since patterns can be found multiple times in a same vector element, the output can only be conveniently stored in a list.

6.1.5 Finding the length of a string

A natural extension to finding the positions of patterns in a text is to find the string’s total length. This can be accomplished with the str_length() function:

str_length(b)
[1] 11

For a multi-element vector, the output looks like this:

str_length(d)
[1]  3  4 10  4  3

6.1.6 Finding a pattern’s frequency

To find out how often the pattern Ab occurs in each element of object d, use the str_count() function.

str_count(d, "Ab")
[1] 1 0 2 0 0

  

6.2 Searching for non alphanumeric strings

  

The following characters need specialized syntax when sought in a regular expression: . , + , * , ? , ^ , $ , ( , ) , [ , ] , { , } , | , \ . For example, when searching for a parenthesis ( in a string, the following code will generate an error:

str_detect("Some text (with parenthesis)", "(with")
Error in stri_detect_regex(string, pattern, negate = negate, opts_regex = opts(pattern)): Incorrectly nested parentheses in regex pattern. (U_REGEX_MISMATCHED_PAREN, context=`(with`)

To resolve this, you need to precede the special character with the escape characters \\ (two backslashes) as in:

str_detect("Some text (with parenthesis)", "\\(with")
[1] TRUE

Likewise, when searching for an asterisk, *, type:

str_detect("x * y", "\\*")
[1] TRUE

Note that not all special characters will generate an error. For example, if you are looking for a period . in a string, the following will not generate the desired outcome:

str_detect("x * y", ".")
[1] TRUE

This should have outputted FALSE given that no period is present in the string. The dot has a special role in a regular expression in that it seeks out any one character in the string–hence the reason it returned TRUE given that we had at least one character in the string x * y.

To generate the desired outcome, add \\:

str_detect("x * y", "\\.")
[1] FALSE

To learn more about regular expressions, see this wikipedia entry on the topic.

 

6.3 Modifying strings

  

6.3.1 Padding numbers with leading zeros

The str_pad() function can be used to pad numbers with leading zeros. Note that in doing so, you are creating a character object from a numeric object.

e <- c(12, 2, 503, 20, 0)
str_pad(e, width=3, side="left", pad = "0" )
[1] "012" "002" "503" "020" "000"

6.3.2 Appending text to strings

You can append strings with custom text using the str_c() functions. For example, to add the string length at the end of each vector element in b type,

str_c(d, " has ", str_length(d), " characters" )
[1] "Abc has 3 characters"         "Def  has 4 characters"        "Abc Def Ab has 10 characters"
[4] " bc  has 4 characters"        "ef  has 3 characters"        

6.3.3 Removing white spaces

You can remove leading or ending (or both) white spaces from a string. For example, to remove leading white spaces from object d type,

d.left.trim <- str_trim(d, side="left")

Now let’s compare the original to the left-trimmed version:

str_length(d)
[1]  3  4 10  4  3
str_length(d.left.trim)
[1]  3  4 10  3  3

To remove trailing spaces set side = "right" and to remove both leading and trailing spaces set side = "both".

6.3.4 Replacing elements of a string

To replace all instances of a specified set of characters in a string with another set of characters, use the str_replace_all() function. For example, to replace all spaces in object b with dashes, type:

str_replace_all(b, " ", "-")
[1] "Abc-def-Abg"

 

6.4 Extracting parts of a string

 

6.4.1 Extracting elements of a string given start and end positions

To find the character elements of a vector at a given position of a given string, use the str_sub() function. For example, to find the characters between positions two and five (inclusive) type:

str_sub(b, start=2, end=5)
[1] "bc d"

If you don’t specify a start position, then all characters up to and including the end position will be returned. Likewise, if the end position is not specified then all characters from the start position to the end of the string will be returned.

6.4.2 Splitting a string by a character

If you want to break a string up into individual components based on a character delimiter, use the str_split() function. For example, to split the following string into separate elements by comma, type the following:

g <- "Year:2000, Month:Jan, Day:23"
str_split(g, ",")
[[1]]
[1] "Year:2000"  " Month:Jan" " Day:23"   

The output is a one component list. If object g consists of more than one element, the output will be a list of as many components as there are g elements.

Depending on your workflow, you may need to convert the str_split output to an atomic vector. For example, if you want to find an element in the above str_split output that matches the string Year:2000, the following will return FALSE and not TRUE as one would expect:

"Year:2000" %in% str_split(g, ",")
[1] FALSE

The workaround is to convert the right-hand output to a single vector using the unlist function:

"Year:2000" %in% unlist(str_split(g, ","))
[1] TRUE

If you are applying the split function to a column of data from a dataframe, you will want to use the function str_split_fixed instead. This function assumes that the number of components to be extracted via the split will be the same for each vector element. For example, the following vector, T1, has two time components that need to be extracted. The separator is a dash, -.

T1 <- c("9:30am-10:45am", "9:00am- 9:50am", "1:00pm- 2:15pm")
T1
[1] "9:30am-10:45am" "9:00am- 9:50am" "1:00pm- 2:15pm"
str_split_fixed(T1, "-", 2)
     [,1]     [,2]     
[1,] "9:30am" "10:45am"
[2,] "9:00am" " 9:50am"
[3,] "1:00pm" " 2:15pm"

The third parameter in the str_split_fixed function is the number of elements to return which also defines the output dimension (here, a three row and two column table). If you want to extract both times to separate vectors, reference the columns by index number:

T1.start <- str_split_fixed(T1, "-", 2)[ ,1]
T1.start
[1] "9:30am" "9:00am" "1:00pm"
T1.end   <- str_split_fixed(T1, "-", 2)[ ,2]
T1.end
[1] "10:45am" " 9:50am" " 2:15pm"

You will want to use the indexes if you are extracting strings in a data frame. For example:

dat <- data.frame( Time = c("9:30am-10:45am", "9:00am-9:50am", "1:00pm-2:15pm"))
dat$Start_time <- str_split_fixed(dat$Time, "-", 2)[ , 1]
dat$End_time   <- str_split_fixed(dat$Time, "-", 2)[ , 2]
dat
            Time Start_time End_time
1 9:30am-10:45am     9:30am  10:45am
2  9:00am-9:50am     9:00am   9:50am
3  1:00pm-2:15pm     1:00pm   2:15pm

6.4.3 Extracting parts of a string that follow a pattern

To extract the three letter months from object g (defined in the last example), you can use a combination of stringr functions as in:

loc <- str_locate(g, "Month:")
str_sub(g, start = loc[,"end"] + 1, end = loc[,"end"]+3)
[1] "Jan"

The above chunk of code first identifies the position of the Month: string and passes its output to the object loc (a matrix). It then uses the loc’s end position in the call to str_sub to extract the three characters making up the month abbreviation. The value 1 is added to the start parameter in str_sub to omit the last character of Month: (recall that the str_locate positions are inclusive).

This can be extend to multi-element vectors as follows:

# Note the differences in spaces and string lenghts between the vector
# elements.
gs <- c("Year:2000, Month:Jan, Day:23",
        "Year:345, Month:Mar, Day:30",
        "Year:1867 , Month:Nov, Day:5")

loc <- str_locate(gs, "Month:")
str_sub(gs, start = loc[,"end"] + 1, end = loc[,"end"]+3)
[1] "Jan" "Mar" "Nov"

Note the non-uniformity in each element’s length and Month: position which requires that we explicitly search for the Month: string position in each element. Had all elements been of equal length and format, we could have simply assigned the position numbers in the call to str_sub function.