How to remove consecutive identical words from a string in python

Short regex magic:

import re

mystring = "my friend's new new new new and old old cats are running running in the street"
res = re.sub(r'\b(\w+\s*)\1{1,}', '\\1', mystring)
print(res)

regex pattern details:

  • \b - word boundary
  • (\w+\s*) - one or more word chars \w+ followed by any number of whitespace characters \s* - enclosed into a captured group (...)
  • \1{1,} - refers to the 1st captured group occurred one or more times {1,}

The output:

my friend's new and old cats are running in the street

Using itertools.groupby:

import itertools

>> ' '.join(k for k, _ in itertools.groupby(mystring.split()))
"my friend's new and old cats are running in the street"
  • mystring.split() splits the mystring.
  • itertools.groupby efficiently groups the consecutive words by k.
  • Using list comprehension, we just take the group key.
  • We join using a space.

The complexity is linear in the size of the input string.

Tags:

Python