How to do an accent insensitive grep?

You are looking for a whole bunch of POSIX regex equivalence classes:

14.3.6.2 Equivalence Class Operators ([= … =])

Regex recognizes equivalence class expressions inside lists. A equivalence class expression is a set of collating elements which all belong to the same equivalence class. You form an equivalence class expression by putting a collating element between an open-equivalence-class operator and a close-equivalence-class operator. [= represents the open-equivalence-class operator and =] represents the close-equivalence-class operator. For example, if a and A were an equivalence class, then both [[=a=]] and [[=A=]] would match both a and A. If the collating element in an equivalence class expression isn’t part of an equivalence class, then the matcher considers the equivalence class expression to be a collating symbol.

I'm using carets on the next line to indicate what is actually colored. I also tweaked the test string to illustrate a point about case.

$ echo "I match àei but also äēì and possibly æi" | grep '[[=a=]][[=e=]][[=i=]]'
I match àei but also äēì and possibly æi
        ^^^          ^^^

This matches all words like aei. The fact that it does not match æi should stand as a reminder that you're beholden to whatever mapping exists in the regex library you're using (presumably gnulib, which is what I linked and quoted), though I figure it's quite likely that digraphs are beyond the reach of even the best equivalence class map.

You should not expect equivalence classes to be portable as they are too arcane.

Taking this a step further, if you want ONLY accented characters, things get far more complicated. Here I've changed your request for aei into [aei].

$ echo "I match àei but also äēì and possibly æi" | grep '[[=a=][=e=][=i=]]'
I match àei but also äēì and possibly æi
^  ^    ^^^     ^    ^^^ ^       ^     ^

Cleaning this up to avoid non-accent matches would require both equivalence classes and look-ahead/look-behind, and while BRE (basic POSIX regex) and ERE (extended POSIX regex) support the former, they both lack the latter. Libpcre (the C library for perl-compatible regex that grep -P and most others use) and perl support the latter but lack the former:

Try #1: grep with libpcre: failure

$ echo "I match àei but also äēì and possibly æi" \
    | grep -P '[[=a=][=e=][=i=]](?<![aei])'
grep: POSIX collating elements are not supported

Try #2: perl itself: failure

$ echo "I match àei but also äēì and possibly æi" \
    | perl -ne 'print if /[[=a=][=e=][=i=]](?<![aei])/'
POSIX syntax [= =] is reserved for future extensions in regex; marked by <-- HERE in m/[[=a=][=e= <-- HERE ][=i=]](?<![aei])/ at -e line 1.

Try #3: python (which has its own PCRE implementation): (silent) failure

$ echo "I match àei but also äēì and possibly æi" \
    | python -c 'import re, sys;
                 print re.findall(r"[[=a=][=e=][=i=]]", sys.stdin.read())'
[]

Wow, a regex feature that PCRE, python, and even perl don't support! There aren't too many of those. (Never mind the complaint being on the second equivalence class, it still complains given just /[[=a=]]/.) This as further evidence that equivalence classes are arcane.

In fact, it appears that there aren't any PCRE libraries capable of equivalence classes; the section on equivalence classes at regular-expressions.info claims only the regex libraries implementing the POSIX standard actually have this support. GNU grep gets closest since it can do BRE, ERE, and PCRE, but it can't combine them.

So we'll do it in two parts.

Try #4: disgusting trickery: success

$ echo "I match àei but also äēì and possibly æi" \
    | grep --color=always '[[=a=][=e=][=i=]]' \
    | perl -pne "s/\e\[[0-9;]*m\e\[K(?i)([aei])/\$1/g"
I match àei but also äēì and possibly æi
        ^            ^^^

Code walk:

grep forces color on so that perl can key on the color codes to note the matches
perl's s/// command matches the color code (\e…\e\[K) then the non-accented letters that we want to remove from the final results, then it replaces all of that with the (uncolored) letters (if that's insufficient, see my guide to removing all ANSI escape sequences)
Anything after (?i) in the perl regex is case-insensitve since [[=i=]] matches I
perl -p prints each line of its input upon completion of its -e execution

For more on BRE vs ERE vs PCRE and others, see this StackExchange regex post or the POSIX regexps at regular-expressions.info. For more on per-language differences (including libpcre vs python PCRE vs perl), look to tools at regular-expressions.info.

2019 Updates: GNU Grep now uses $GREP_COLORS which can look like ms=1;41 which takes priority over the older $GREP_COLOR like 1;41. This is harder to extract (and it's hard to juggle between the two), so I modified the perl code in try #4 to seek out any SGR color code instead of keying on just the color that grep would add. See revision 2 of this answer for the previous code.

I cannot currently verify whether BSD grep, which is used by Apple Mac OS X, supports POSIX regex equivalence classes.

I don't think this can be done in grep, unless you're willing to write a shell script that uses iconv and diff, which would be a bit visually different from what you're requesting.

Here is something very close to your request via a quick perl script:

#!/usr/bin/perl
# tgrep 0.1 Copyright 2014 by Adam Katz, GPL version 2 or later

use strict;
use warnings;
use open qw(:std :utf8);
use Text::Unidecode;

my $regex = shift or die "Missing pattern.\nUsage: tgrep PATTERN [FILE...]";

my $retval = 1;  # default to false (no hits)

while(<>) {
  my $line = "", my $hit = 0;
  while(/\G(\S*(?:\s+|$))/g){             # for each word (w/ trailing spaces)
    my $word = $1;
    if(unidecode($word) =~ qr/$regex/) {  # if there was a match
      $hit++;                             # note that fact
      $retval = 0;                        # final exit code will be 0 (true)
      $line .= "\e[1;31m$word\e[0;0m";    # display word in RED
    } else {
      $line .= $word;                     # display non-matching word normally
    }
  }
  print $line if $hit;                    # only display lines with matches
}

exit $retval;

Markdown doesn't allow me to make red text, so here's the output with the hits in quotes instead:

$ echo "match àei but also äēì and possibly æi" | tgrep aei
match "àei" but also "äēì" and possibly "æi"

This will highlight matching words rather than the actual match, which would be very difficult to do without making massive character classes and/or composing a piecemeal regex parser. Therefore, searching for the pattern "ae" instead of "aei" would produce the same results (in this case).

None of grep's flags are replicated in this toy example. I wanted to keep it simple.

How to do an accent insensitive grep?

Tags:

Encoding

Grep

Diacritics

Matching

Iconv

Related

Recent Posts