Explain sed expression that deletes lines with repeated fields

As a first step, you need to understand \(.\) correctly. In basic regular expressions, it is a capture group capturing any character, that must be reproduced by \1. Those are not literal parenthesis.


Now, for the extremely cool part! What does each element of the regex match in each case?

     Left  \(.\)  .*  \1  Right  Result
111        1      1   1          Deleted!
112        1          1   2      Deleted!
113        1          1   3      Deleted!
121        1      2   1          Deleted!
122  1     2          2          Deleted!
123        ?      ?   ?          NoMatch
131        1      3   1          Deleted!
132        ?      ?   ?          NoMatch
133  1     3          3          Deleted!      

On the 122, if not clear: Since the expression is not anchored, 1 goes away to the left, the middle 2 matches the capture group \(.\) and the last 2 matches the backreference \1. .* (the zero-or-more characters matching regex) will do its best to fit the string, so in this case it contracts to the null string.

If you doubt it, try

echo 122 | grep --color=always '\(.\).*\1'

You will see that only the 22 has been colored.


Compare it with the anchored version of the regex:

$ printf "%s\n" {1,2,3}{1,2,3}{1,2,3} | sort -u | sed '/^\(.\).*\1$/d'
112
113
122
123
132
133
...

Now there are not "Left" and "Right" slots:

     ^\(.\)  .*  \1$  Result
111  1       1   1    Deleted!
112  ?       ?   ?    NoMatch
113  ?       ?   ?    NoMatch
121  1       2   1    Deleted!
122  ?       ?   ?    NoMatch
123  ?       ?   ?    NoMatch
131  1       3   1    Deleted!
132  ?       ?   ?    NoMatch
133  ?       ?   ?    NoMatch

The 1st digit must be the last digit in this version, so there are less matches.


That's basic regular expressions (BRE) for sed by default, so \(.\) is a capture group containing any one character. Then the .* just skips everything, and \1 matches whatever the group matched. If the whole lot can be made to match, then some character showed up twice, once for the group, and once for the backreference.

In fact, if I'm not mistaken, that wouldn't even work with standard extended regular expressions, since (for whatever reasons) backreferences aren't supported in them. Backreferences are only mentioned under "BREs matching multiple characters", not under EREs, and in fact the same thing with ERE doesn't work on my macOS (it takes the \1 as meaning a literal number 1):

$ printf "%s\n" 122 321 | sed -E -e '/(.).*\1/d'
122

GNU tools do support backreferences in ERE, though.

(I don't think sort -u is necessary here, the combination of brace expansions should produce all combinations without duplicates.)