Count the number of pattern matches in a string

You can use gregexpr to find the positions of "CG" in vec. We have to check whether there was no match (-1). The function sum counts the number of matches.

> vec <- "AAAAAAACGAAAAAACGAAADGCGEDCG"
> sum(gregexpr("CG", vec)[[1]] != -1)
[1] 4

If you have a vector of strings, you can use sapply:

> vec <- c("ACACACACA", "GGAGGAGGAG", "AACAACAACAAC", "GGCCCGCCGC", "TTTTGTT", "AGAGAGA")
> sapply(gregexpr("CG", vec), function(x) sum(x != -1))
[1] 0 0 0 2 0 0

If you have a list of strings, you can use unlist(vec) and then use the solution above.


The Bioconductor package Biostrings has a matchPattern function

countGC <- matchPattern("GC",DNSstring_object)

Note that DNSstring_object is FASTA sequence read in using the biostring function readDNAStringSet or readAAStringSet


Use str_count from stringr. It's simple to remember and read, though not a base function.

library(stringr)
str_count("AAAAAAACGAAAAAACGAAADGCGEDCG", "CG")
# [1] 4

Tags:

String

Regex

R