Extract info inside all parenthesis in R

Using the stringr package we can reduce this a little bit.

library(stringr)
# Get the parenthesis and what is inside
k <- str_extract_all(j, "\\([^()]+\\)")[[1]]
# Remove parenthesis
k <- substring(k, 2, nchar(k)-1)

@kohske uses regmatches but I'm currently using 2.13 so don't have access to that function at the moment. This adds the dependency on stringr but I think it is a little easier to work with and the code is a little clearer (well... as clear as using regular expressions can be...)

Edit: We could also try something like this -

re <- "\\(([^()]+)\\)"
gsub(re, "\\1", str_extract_all(j, re)[[1]])

This one works by defining a marked subexpression inside the regular expression. It extracts everything that matches the regex and then gsub extracts only the portion inside the subexpression.


Using rex may make this type of task a little simpler.

matches <- re_matches(j,
  rex(
    "(",
    capture(name = "text", except_any_of(")")),
    ")"),
  global = TRUE)

matches[[1]]$text
#>[1] "wonder" "groan"  "Laugh"

I think there are basically three easy ways of extracting multiple capture groups in R (without using substitution); str_match_all, str_extract_all, and regmatches/gregexpr combo.

I like @kohske's regex, which looks behind for an open parenthesis ?<=\\(, looks ahead for a closing parenthesis ?=\\), and grabs everything in the middle (lazily) .+?, in other words (?<=\\().+?(?=\\))

Using the same regex:

str_match_all returns the answer as a matrix.

str_match_all(j, "(?<=\\().+?(?=\\))")

     [,1]    
[1,] "wonder"
[2,] "groan" 
[3,] "Laugh" 

# Subset the matrix like this....

str_match_all(j, "(?<=\\().+?(?=\\))")[[1]][,1]
[1] "wonder" "groan"  "Laugh" 

str_extract_all returns the answer as a list.

str_extract_all(j,  "(?<=\\().+?(?=\\))")
[[1]]
[1] "wonder" "groan"  "Laugh" 

#Subset the list...
str_extract_all(j,  "(?<=\\().+?(?=\\))")[[1]]
[1] "wonder" "groan"  "Laugh" 

regmatches/gregexpr also returns the answer as a list. Since this is a base R option, some people prefer it. Note the recommended perl = TRUE.

regmatches(j, gregexpr( "(?<=\\().+?(?=\\))", j, perl = T))
[[1]]
[1] "wonder" "groan"  "Laugh" 

#Subset the list...
regmatches(j, gregexpr( "(?<=\\().+?(?=\\))", j, perl = T))[[1]]
[1] "wonder" "groan"  "Laugh" 

Hopefully, the SO community will correct/edit this answer if I've mischaracterized the most popular options.


Here is an example:

> gsub("[\\(\\)]", "", regmatches(j, gregexpr("\\(.*?\\)", j))[[1]])
[1] "wonder" "groan"  "Laugh" 

I think this should work well:

> regmatches(j, gregexpr("(?=\\().*?(?<=\\))", j, perl=T))[[1]]
[1] "(wonder)" "(groan)"  "(Laugh)" 

but the results includes parenthesis... why?

This works:

regmatches(j, gregexpr("(?<=\\().*?(?=\\))", j, perl=T))[[1]]

Thanks @MartinMorgan for the comment.

Tags:

Regex

R