stringr::str_detect()

Function of the Week: stringr::str_detect()

str_detect()

In this document, I will introduce the str_detect() function and show what it’s for.

#load tidyverse up
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.2     ✓ purrr   0.3.4
## ✓ tibble  3.0.4     ✓ dplyr   1.0.2
## ✓ tidyr   1.1.2     ✓ stringr 1.4.0
## ✓ readr   1.4.0     ✓ forcats 0.5.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
#example dataset
library(palmerpenguins)
data(penguins)
data(package = 'palmerpenguins')

What is it for?

Part of the stringr package (within tidyverse).

When we look at the documentation for str_detect() in the ‘R Help’ tab, we see that this function allows us to “detect the presence or absence of a pattern in a string.” In other words, we can see if particular text or pattern of text is present in our data. We can even check if strings in the data start with a particular character, contain multiples of a character, or have a set of a few different characters str_detect() can also be combined with other functions in order to filter data by the results, or to look at totals or proportions (helpful when looking at larger character vectors).

There are many, many more expressions that can be used to describe patterns within strings. This PDF cheat sheet is a great resource to start: https://evoldyn.gitlab.io/evomics-2018/ref-sheets/R_strings.pdf

# ab|d looks for a and b, or d; island names containing T or B - Torgersen and Biscoe
str_detect(penguins$island, "T|B")
##   [1]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
##  [13]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
##  [25]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE
##  [37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [49] FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
##  [61]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
##  [73]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
##  [85] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [97] FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [109]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [121]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [133] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [145] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE
## [157]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [169]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [181]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [193]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [205]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [217]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [229]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [241]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [253]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [265]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [277] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [289] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [301] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [313] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [325] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [337] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

Here’s how we can combine str_detect() (pretty simple example) to create a subset.

# Filtering by islands starting with T or B - Torgersen and Biscoe
penguins_subset <- penguins %>%
  filter(str_detect(island, "^T|B"))

Is it helpful?

Overall, str_detect() can be very useful when working with text data. In combination with other functions in the stringr package and other R functions it is especially useful. In example I read about (https://blog.exploratory.io/filter-with-text-data-952df792c2ba) the author had 6 different cities in his data set that contained the phrase ‘New’, but wanted to only select New York and Newark. Using ‘.’ (dot) and ’*’ (asterisk), he filtered by city names that had any number of characters between ‘New’ and ‘rk’, and the asterisk to match any of the characters before the dot zero or more times (encapturing any number of letters).

flight <- read_csv("data/airline_delay_2014_1.csv")
## Warning: Missing column names filled in: 'X27' [27]
## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   .default = col_character(),
##   YEAR = col_double(),
##   MONTH = col_double(),
##   DAY_OF_MONTH = col_double(),
##   FL_DATE = col_date(format = ""),
##   FL_NUM = col_double(),
##   DEST_AIRPORT_ID = col_double(),
##   DEP_DELAY = col_double(),
##   ARR_DELAY = col_double(),
##   CANCELLED = col_double(),
##   AIR_TIME = col_double(),
##   DISTANCE = col_double(),
##   CARRIER_DELAY = col_double(),
##   WEATHER_DELAY = col_double(),
##   X27 = col_logical()
## )
## ℹ Use `spec()` for the full column specifications.
flight %>%
  select(FL_DATE, CARRIER, ORIGIN, ORIGIN_CITY_NAME, ORIGIN_STATE_ABR, DEP_DELAY, DEP_TIME, ARR_DELAY, ARR_TIME) %>%
filter(str_detect(ORIGIN_CITY_NAME, "New.*rk")) %>%
count(ORIGIN_CITY_NAME)