Regular Expression for finding double characters in Bash

This really is two questions, and should have been split up. But since the answers are relatively simple, I will put them here. These answers are for GNU grep specifically.

a) egrep is the same as grep -E. Both indicate that "Extended Regular Expressions" should be used instead of grep's default Regular Expressions. grep requires the backslashes for plain Regular Expressions.

From the man page:

Basic vs Extended Regular Expressions

In basic regular expressions the meta-characters ?, +, {, |, (, and ) lose their special meaning; instead use the backslashed versions \?, \+, \{, \|, \(, and \).

See the man page for additional details about historical conventions and portability.

b) Use egrep '(.)\1{N}' and replace N with the number of characters you wish to replace minus one (since the dot matches the first one). So if you want to match a character repeated four times, use egrep '(.)\1{3}'.


This would look for 2 or more occurences of the same character:

grep -E '(.)\1+' file

If your awk has the -o option this would print it each match on a new line..

grep -Eo '(.)\1+' file

To find matches with exactly 3 matches:

grep -E '(.)\1{2}' file

Or 3 or more:

grep -E '(.)\1{2,}' file

etc..


edit

Actually @stephane_chazelas is right about back references and -E. I had forgotten about that. I tried it in BSD grep and GNU grep and it works there but it is not in some other greps. You would need to use one of the below version..

Regular grep versions:

grep '\(.\)\1\{1,\}' file

grep -o '\(.\)\1\{1,\}' file

grep '\(.\)\1\{2\}' file

grep '\(.\)\1\{2,\}' file

The -o option is also not standard grep BTW (probably if your grep understands -o it can also do the back reference)..


Note: grep -E '(.)\1{2,}' file and grep '\(.\)\1\{2\}' file are wrong as alexis indicated and should be ignored..


First, thank you all for your supporting comments and suggestions. As it turns out I was already quite close to the answer.

The Main Issue was about:

Is there a simple way to look for n occurences of the same character, e.g. aa, tttttt

Short answer:

The following [variations of] commands will repeat a at least one and infinite times

grep 'a\{1,}

grep -E \(a\)\{1,\}

egrep a{1,}

or, with GNU Regular Expressions available grep a\+


The number of repeatings are set inside the curly brackets, through the pattern {min,max}{n} repeat exactly n times, {n,} repeat at least n times and {n,m} repeat at least n but at most m times.

Thus, as a consequence, raised the secondary issue:

Is the necessity of setting backlashes bound to the command I use?

Short answer: Yes, the use of backslashes depends on whether one uses grep or egrep

  • grep: backslash activates metacharacters [uses Basic Regular Expressions]
  • egrep backslash de-activates metacharacters [uses Extended Regular Expressions]

As this is the short answer, I want to provide those who ran into comparable issues, I added my basic summary of what out one seemingly has to be aware of, working with grep and egrep.




Basic, Extended, and GNU Regular Expressions

Basic Regular Expressions

Used in grep, ed and sed command

Basic Regular Expressions set features are:

  • Most Metacharacters, e.g. ? [ . \ ) etc. are activated through a backslash. If there is no backslash they will be taken as (part of the) search term.
  • ^ $ \< and \> are supported without a backslash
  • No shorthand characters [\b, \s, etc.]

GNU Basic Regular Expressions add to these

  • \? repeat character zero or one time (c\? matches cand cc) and is an alternative for \{0,1\}
  • \+ repeat a character at least one time (c\+ matches cc, cccccccc etc.) and is an alternative for \{1,\}

  • \| is supported (e.g. grep a\|b will look for a or b

grep -E enables the command to use the whole set of the Extended Regular Expressions:


Extended Regular Expressions [ERE]

Used in egrep, awk and emacs is the Basic Set plus quite some features.

  • Metacharacters are deactivated through a backslash
  • No back references
  • else: a lot of the the magic Regular Expressions usually can do for one

GNU Extendend Regular Expressions

adds the following features

  • shorthand classes
  • quantifiers

The two links will direct one to regular-expressions.info which, in addition to the awsome support I've got here, really helped me a lot.