Why do regex engines allow / automatically attempt matching at the end of the input string?

I am giving this answer just to demonstrate why a regex would want to allow any code appearing after the final $ anchor in the pattern. Suppose we needed to create a regex to match a string with the following rules:

  • starts with three numbers
  • followed by one or more letters, numbers, hyphen, or underscore
  • ends with only letters and numbers

We could write the following pattern:

^\d{3}[A-Za-z0-9\-_]*[A-Za-z0-9]$

But this is a bit bulky, because we have to use two similar character classes adjacent to each other. Instead, we could write the pattern as:

^\d{3}[A-Za-z0-9\-_]+$(?<!_|-)

or

^\d{3}[A-Za-z0-9\-_]+(?<!_|-)$

Here, we eliminated one of the character classes, and instead used a negative lookbehind after the $ anchor to assert that the final character was not underscore or hyphen.

Other than a lookbehind, it makes no sense to me why a regex engine would allow something to appear after the $ anchor. My point here is that a regex engine may allow a lookbehind to appear after the $, and there are cases for which it logically makes sense to do so.


Recall several things:

  1. ^ and $ are zero width assertions - they match right after the logical start of the string (or after each line ending in multiline mode with the m flag in most regex implementations) or at the logical end of string (or end of line BEFORE the end of line character or characters in multiline mode.)

  2. .* is potentially a zero length match of no match at all. The zero length only version would be $(?:end of line){0} DEMO (which is useful as a comment I guess...)

  3. . does not match \n (unless you have the s flag) but does match the \r in Windows CRLF line endings. So $.{1} only matches Windows line endings for example (but don't do that. Use the literal \r\n instead.)

There is no particular benefit other than simple side effect cases.

  1. The regex $ is useful;
  2. .* is useful.
  3. The regex ^(?a lookahead) and (?a lookbehind)$ are common and useful.
  4. The regex (?a lookaround)^ or $(?a lookaround) are potentially useful.
  5. The regex $.* is not useful and rare enough to not warrant implementing some optimization to have the engine stop looking with that edge case. Most regex engines do a decent job of parsing syntax; a missing brace or parenthesis for example. To have the engine parse $.* as not useful would require parsing meaning of that regex as different than $(something else)
  6. What you get will be highly dependent on the regex flavor and the status of the s and m flags.

For examples of replacements, consider the following Bash script output from some major regex flavors:

#!/bin/bash

echo "perl"
printf  "123\r\n" | perl -lnE 'say if s/$.*/X/mg' | od -c
echo "sed"
printf  "123\r\n" | sed -E 's/$.*/X/g' | od -c
echo "python"
printf  "123\r\n" | python -c "import re, sys; print re.sub(r'$.*', 'X', sys.stdin.read(),flags=re.M) " | od -c
echo "awk"
printf  "123\r\n" | awk '{gsub(/$.*/,"X")};1' | od -c
echo "ruby"
printf  "123\r\n" | ruby -lne 's=$_.gsub(/$.*/,"X"); print s' | od -c

Prints:

perl
0000000    X   X   2   X   3   X  \r   X  \n                            
0000011
sed
0000000    1   2   3  \r   X  \n              
0000006
python
0000000    1   2   3  \r   X  \n   X  \n                                
0000010
awk
0000000    1   2   3  \r   X  \n                                        
0000006
ruby
0000000    1   2   3   X  \n                                            
0000005

What is the reason behind using .* with global modifier on? Because someone somehow expects an empty string to be returned as a match or he / she isn't aware of what * quantifier is, otherwise global modifier shouldn't be set. .* without g doesn't return two matches.

it's not obvious what the benefit of this behavior is.

There shouldn't be a benefit. Actually you are questioning zero-length matches existence. You are asking why does a zero-length string exist?

We have three valid places that a zero-length string exists:

  • Start of subject string
  • Between two characters
  • End of subject string

We should look for the reason rather than the benefit of that second zero-length match output using .* with g modifier (or a function that searches for all occurrences). That zero-length position following an input string has some logical uses. Below state diagram is grabbed from debuggex against .* but I added epsilon on the direct transition from start state to accept state to demonstrate a definition:

enter image description here
(source: pbrd.co)

That's a zero-length match (read more about epsilon transition).

These all relates to greediness and non-greediness. Without zero-length positions a regex like .?? wouldn't have a meaning. It doesn't attempt the dot first, it skips it. It matches a zero-length string for this purpose to transit the current state to a temporary acceptable state.

Without a zero-length position .?? never could skip a character in input string and that results in a whole brand new flavor.

Definition of greediness / laziness leads into zero-length matches.