String split with conditions in R

You can do this using gsubfn

library(gsubfn)
f <- function(x,y,z) if (z=="_") y else strsplit(x, ".ReCal", fixed=T)[[1]][[1]]
gsubfn("([^_]+_[^_]+)(.).*", f, mystring, backref=2)
# [1] "MODY_60.2"   "MODY_116.21" "MODY_116.3"  "MODY_116.4" 

This allows for cases when you have more than two "_", and you want to split on the second one, for example,

mystring<-c("MODY_60.2.ReCal.sort.bam",
            "MODY_116.21_C4U.ReCal.sort.bam",
            "MODY_116.3_C2RX-1-10.ReCal.sort.bam",
            "MODY_116.4.ReCal.sort.bam",
            "MODY_116.4_asdfsadf_1212_asfsdf",
            "MODY_116.5.ReCal_asdfsadf_1212_asfsdf",  # split by second "_", leaving ".ReCal"
            "MODY")

gsubfn("([^_]+_[^_]+)(.).*", f, mystring, backref=2)
# [1] "MODY_60.2"        "MODY_116.21"      "MODY_116.3"       "MODY_116.4"      
# [5] "MODY_116.4"       "MODY_116.5.ReCal" "MODY"            

In the function, f, x is the original string, y and z are the next matches. So, if z is not a "_", then it proceeds with the splitting by the alternative string.


With the stringr package:

str_extract(mystring, '.*?_.*?(?=_)|^.*?_.*(?=\\.ReCal)')
[1] "MODY_60.2" "MODY_116.21" "MODY_116.3" "MODY_116.4"

It also works with more than two delimiters.


Perl/PCRE has the branch reset feature that lets you reuse a group number when you have capturing groups in different alternatives, and is considered as one capturing group.

IMO, this feature is elegant when you want to supply different alternatives.

x <- c('MODY_60.2.ReCal.sort.bam', 'MODY_116.21_C4U.ReCal.sort.bam', 
       'MODY_116.3_C2RX-1-10.ReCal.sort.bam', 'MODY_116.4.ReCal.sort.bam',
       'MODY_116.4_asdfsadf_1212_asfsdf', 'MODY_116.5.ReCal_asdfsadf_1212_asfsdf', 'MODY')

sub('^(?|([^_]*_[^_]*)_.*|(.*)\\.ReCal.*)$', '\\1', x, perl=T)
# [1] "MODY_60.2"        "MODY_116.21"      "MODY_116.3"       "MODY_116.4"      
# [5] "MODY_116.4"       "MODY_116.5.ReCal" "MODY"