stringr |
---|
1.5.1 |
6 Working with string objects
Numbers and dates are not the only variable types we might be interested in exploring or manipulating. Character (text) objects can be part of a data query operation as well. In the programming environment, such queries are often referred to as string searches. String queries may involve assessing if a variable matches or contains an exact set of characters; it can also involve extracting a certain set of characters given some pattern. R has a very capable set of string operations built into its environment however, many find it difficult to master. A package that will be used in this tutorial that simplifies this task is called stringr
.
6.1 Finding patterns in a string
6.1.1 Checking for an exact string match
This is the simplest string operation one can perform. It involves assessing if a variable is equal (or not) to a complete text string.
We’ve already seen how the conditional statements can be used to check whether a variable is equal to, less than or greater than a number. We can use conditional statements to evaluate if a variable matches an exact string. For example, the following chunk of code returns TRUE
since the strings match exactly.
<- "Abc def"
a == "Abc def" a
[1] TRUE
However, note that R differentiates cases so that the following query, returns FALSE
since the first character does not match in case (i.e. upper case A
vs. lower case a
).
== "abc def" a
[1] FALSE
6.1.1.1 How to ignore case sensitivity?
If you want R to ignore cases in any string operations, simply force all variables to a lower case and define the pattern being compared against in lower case. For example:
tolower(a) == "abc def"
[1] TRUE
6.1.2 Checking for a partial match
6.1.2.1 Matching anywhere in the string
To check if object a
has the pattern "c d"
(note the space in between the letters) anywhere in its string, use stringr
’s str_detect
function as follows:
library(stringr)
str_detect(a, "c d")
[1] TRUE
The following example compares the string to "cd"
(note the omission of the space):
str_detect(a, "cd")
[1] FALSE
6.1.2.2 Matching the begining of the string
To check if object a
starts with the pattern "c d"
add the carat character ^
in front of the pattern as in:
str_detect(a, "^c d")
[1] FALSE
6.1.2.3 Matching the end of the sring
To check if object a
ends with the pattern "Abc"
add the dollar character $
to the end of the pattern as in:
str_detect(a, "Abc$")
[1] FALSE
6.1.3 Matching against a list of characters
If you want to match against a list of characters, you may want to make use of the matching, %in%
, operator. This is a built-in function and thus not part of the stringr
package. For example, to check if any of the characters "Jun"
or "Sep"
exist in object b
, type:
<- c("Jan", "Feb", "Jun", "Oct")
b %in% c("Jun", "Sep") b
[1] FALSE FALSE TRUE FALSE
The output will generate a boolean value for each element in b
.
The %in%
operator is described in greater detail in the next chapter.
6.1.4 Locating the position of a pattern in a string
If you want to find where a particular pattern lies in a string, use the str_locate
function. For example, to find where the pattern "c d"
occurs in object a
type:
str_locate(a, "c d")
start end
[1,] 3 5
The function returns two values: the position in the string where the pattern starts (e.g. position 3) and the position where the pattern ends (e.g. position 5 )
Note that if the pattern is not found, str_locate
returns NA
’s:
str_locate(a, "cd")
start end
[1,] NA NA
Note too that the str_locate
function only returns the position of the first occurrence. For example, the following chunk will only return the start/end positions of the first occurrence of Ab
.
<- "Abc def Abg"
b str_locate(b, "Ab")
start end
[1,] 1 2
To find all occurrences, use the str_locate_all()
function as in:
str_locate_all(b,"Ab")
[[1]]
start end
[1,] 1 2
[2,] 9 10
The function returns a list
object. To extract the position values into a dateframe, simply wrap the function in a call to as.data.frame
, for example:
<- as.data.frame(str_locate_all(b,"Ab"))
str.pos str.pos
start end
1 1 2
2 9 10
The reason str_locate_all
returns a list and not a matrix or a data frame can be understood in the following example:
# Create a 5 element string vector
<- c("Abc", "Def ", "Abc Def Ab", " bc ", "ef ")
d
# Search for all instances of "Ab"
str_locate_all(d,"Ab")
[[1]]
start end
[1,] 1 2
[[2]]
start end
[[3]]
start end
[1,] 1 2
[2,] 9 10
[[4]]
start end
[[5]]
start end
Here, d
is a five element string vector (so far we’ve worked with single element vectors). The str_locate_all
function returns a result for each element of that vector, and since patterns can be found multiple times in a same vector element, the output can only be conveniently stored in a list.
6.1.5 Finding the length of a string
A natural extension to finding the positions of patterns in a text is to find the string’s total length. This can be accomplished with the str_length()
function:
str_length(b)
[1] 11
For a multi-element vector, the output looks like this:
str_length(d)
[1] 3 4 10 4 3
6.1.6 Finding a pattern’s frequency
To find out how often the pattern Ab
occurs in each element of object d
, use the str_count()
function.
str_count(d, "Ab")
[1] 1 0 2 0 0
6.2 Searching for non alphanumeric strings
The following characters need specialized syntax when sought in a regular expression: .
, +
, *
, ?
, ^
, $
, (
, )
, [
, ]
, {
, }
, |
, \
. For example, when searching for a parenthesis (
in a string, the following code will generate an error:
str_detect("Some text (with parenthesis)", "(with")
Error in stri_detect_regex(string, pattern, negate = negate, opts_regex = opts(pattern)): Incorrectly nested parentheses in regex pattern. (U_REGEX_MISMATCHED_PAREN, context=`(with`)
To resolve this, you need to precede the special character with the escape characters \\
(two backslashes) as in:
str_detect("Some text (with parenthesis)", "\\(with")
[1] TRUE
Likewise, when searching for an asterisk, *
, type:
str_detect("x * y", "\\*")
[1] TRUE
Note that not all special characters will generate an error. For example, if you are looking for a period .
in a string, the following will not generate the desired outcome:
str_detect("x * y", ".")
[1] TRUE
This should have outputted FALSE
given that no period is present in the string. The dot has a special role in a regular expression in that it seeks out any one character in the string–hence the reason it returned TRUE
given that we had at least one character in the string x * y
.
To generate the desired outcome, add \\
:
str_detect("x * y", "\\.")
[1] FALSE
To learn more about regular expressions, see this wikipedia entry on the topic.
6.3 Modifying strings
6.3.1 Padding numbers with leading zeros
The str_pad()
function can be used to pad numbers with leading zeros. Note that in doing so, you are creating a character object from a numeric object.
<- c(12, 2, 503, 20, 0)
e str_pad(e, width=3, side="left", pad = "0" )
[1] "012" "002" "503" "020" "000"
6.3.2 Appending text to strings
You can append strings with custom text using the str_c()
functions. For example, to add the string length at the end of each vector element in b
type,
str_c(d, " has ", str_length(d), " characters" )
[1] "Abc has 3 characters" "Def has 4 characters" "Abc Def Ab has 10 characters"
[4] " bc has 4 characters" "ef has 3 characters"
6.3.3 Removing white spaces
You can remove leading or ending (or both) white spaces from a string. For example, to remove leading white spaces from object d
type,
<- str_trim(d, side="left") d.left.trim
Now let’s compare the original to the left-trimmed version:
str_length(d)
[1] 3 4 10 4 3
str_length(d.left.trim)
[1] 3 4 10 3 3
To remove trailing spaces set side = "right"
and to remove both leading and trailing spaces set side = "both"
.
6.3.4 Replacing elements of a string
To replace all instances of a specified set of characters in a string with another set of characters, use the str_replace_all()
function. For example, to replace all spaces in object b
with dashes, type:
str_replace_all(b, " ", "-")
[1] "Abc-def-Abg"
6.4 Extracting parts of a string
6.4.1 Extracting elements of a string given start and end positions
To find the character elements of a vector at a given position of a given string, use the str_sub()
function. For example, to find the characters between positions two and five (inclusive) type:
str_sub(b, start=2, end=5)
[1] "bc d"
If you don’t specify a start
position, then all characters up to and including the end
position will be returned. Likewise, if the end
position is not specified then all characters from the start
position to the end of the string will be returned.
6.4.2 Splitting a string by a character
If you want to break a string up into individual components based on a character delimiter, use the str_split()
function. For example, to split the following string into separate elements by comma, type the following:
<- "Year:2000, Month:Jan, Day:23"
g str_split(g, ",")
[[1]]
[1] "Year:2000" " Month:Jan" " Day:23"
The output is a one component list. If object g
consists of more than one element, the output will be a list of as many components as there are g
elements.
Depending on your workflow, you may need to convert the str_split
output to an atomic vector. For example, if you want to find an element in the above str_split
output that matches the string Year:2000
, the following will return FALSE
and not TRUE
as one would expect:
"Year:2000" %in% str_split(g, ",")
[1] FALSE
The workaround is to convert the right-hand output to a single vector using the unlist
function:
"Year:2000" %in% unlist(str_split(g, ","))
[1] TRUE
If you are applying the split function to a column of data from a dataframe, you will want to use the function str_split_fixed
instead. This function assumes that the number of components to be extracted via the split will be the same for each vector element. For example, the following vector, T1
, has two time components that need to be extracted. The separator is a dash, -
.
<- c("9:30am-10:45am", "9:00am- 9:50am", "1:00pm- 2:15pm")
T1 T1
[1] "9:30am-10:45am" "9:00am- 9:50am" "1:00pm- 2:15pm"
str_split_fixed(T1, "-", 2)
[,1] [,2]
[1,] "9:30am" "10:45am"
[2,] "9:00am" " 9:50am"
[3,] "1:00pm" " 2:15pm"
The third parameter in the str_split_fixed
function is the number of elements to return which also defines the output dimension (here, a three row and two column table). If you want to extract both times to separate vectors, reference the columns by index number:
<- str_split_fixed(T1, "-", 2)[ ,1]
T1.start T1.start
[1] "9:30am" "9:00am" "1:00pm"
<- str_split_fixed(T1, "-", 2)[ ,2]
T1.end T1.end
[1] "10:45am" " 9:50am" " 2:15pm"
You will want to use the indexes if you are extracting strings in a data frame. For example:
<- data.frame( Time = c("9:30am-10:45am", "9:00am-9:50am", "1:00pm-2:15pm"))
dat $Start_time <- str_split_fixed(dat$Time, "-", 2)[ , 1]
dat$End_time <- str_split_fixed(dat$Time, "-", 2)[ , 2]
dat dat
Time Start_time End_time
1 9:30am-10:45am 9:30am 10:45am
2 9:00am-9:50am 9:00am 9:50am
3 1:00pm-2:15pm 1:00pm 2:15pm
6.4.3 Extracting parts of a string that follow a pattern
To extract the three letter months from object g
(defined in the last example), you can use a combination of stringr
functions as in:
<- str_locate(g, "Month:")
loc str_sub(g, start = loc[,"end"] + 1, end = loc[,"end"]+3)
[1] "Jan"
The above chunk of code first identifies the position of the Month:
string and passes its output to the object loc
(a matrix). It then uses the loc
’s end
position in the call to str_sub
to extract the three characters making up the month abbreviation. The value 1
is added to the start
parameter in str_sub
to omit the last character of Month:
(recall that the str_locate
positions are inclusive).
This can be extend to multi-element vectors as follows:
# Note the differences in spaces and string lenghts between the vector
# elements.
<- c("Year:2000, Month:Jan, Day:23",
gs "Year:345, Month:Mar, Day:30",
"Year:1867 , Month:Nov, Day:5")
<- str_locate(gs, "Month:")
loc str_sub(gs, start = loc[,"end"] + 1, end = loc[,"end"]+3)
[1] "Jan" "Mar" "Nov"
Note the non-uniformity in each element’s length and Month:
position which requires that we explicitly search for the Month:
string position in each element. Had all elements been of equal length and format, we could have simply assigned the position numbers in the call to str_sub
function.