Match and remove duplicated characters: Replace multiple (3+) non-consecutive occurrences

Non-regex R solution. Split string. Replace elements of this vector having rowid >= 3 * with '-'. Paste it back together.

x <- '111aabbccxccybbzaa1'

xsplit <- strsplit(x, '')[[1]]
xsplit[data.table::rowid(xsplit) >= 3] <- '-'
paste(xsplit, collapse = '')

# [1] "11-aabbccx--y--z---"

* rowid(x) is an integer vector with each element representing the number of times the value from the corresponding element of x has been realized. So if the last element of x is 1, and that's the fourth time 1 has occurred in x, the last element of rowid(x) is 4.


You can easily accomplish this without regex:

See code in use here

s = '111aabbccxccybbzaa1'

for u in set(s):
    for i in [i for i in range(len(s)) if s[i]==u][2:]:
        s = s[:i]+'-'+s[i+1:]

print(s)

Result:

11-aabbccx--y--z---

How this works:

  1. for u in set(s) gets a list of unique characters in the string: {'c','a','b','y','1','z','x'}
  2. for i in ... loops over the indices that we gather in 3.
  3. [i for i in range(len(s)) if s[i]==u][2:] loops over each character in the string and checks if it matches u (from step 1.), then it slices the array from the 2nd element to the end (dropping the first two elements if they exist)
  4. Set the string to s[:i]+'-'+s[i+1:] - concatenate the substring up to the index with - and then the substring after the index, effectively omitting the original character.

An option with gsubfn

library(gsubfn)
p <- proto(fun = function(this, x) if (count >=3) '-' else x)
for(i in c(0:9, letters)) x <- gsubfn(i, p, x)
x
#[1] "11-aabbccx--y--z---"

data

x <- '111aabbccxccybbzaa1'