How or Why using `.*?` is better than `.*`?

Suppose I take a string like:

can cats eat plants?

Using the greedy c.*s will match the entire string since it starts with c and ends with s, being a greedy operator it continues to match until the final occurrence of s.

Whereas using the lazy c.*?s will only match until the first occurrence of s is found, i.e. string can cats.

From the above example, you might be able to gather that:

"Greedy" means matching the longest possible string. "Lazy" means matching the shortest possible string. Adding a ? to a quantifier like *, +, ?, or {n,m} makes it lazy.


Ashok already pointed out the difference between .* and .*?, so I'll just provide some additional information.

grep (assuming the GNU version) supports 4 ways to match strings:

  • Fixed strings
  • Basic regular expressions (BRE)
  • Extended regular expressions (ERE)
  • Perl-compatible regular expressions (PCRE)

grep uses BRE by default.

BRE and ERE are documented in the Regular Expressions chapter of POSIX and PCRE is documented in its official website. Please note that features and syntax may vary between implementations.

It's worth saying that neither BRE nor ERE support lazyness:

The behavior of multiple adjacent duplication symbols ( '+', '*', '?', and intervals) produces undefined results.

So if you want to use that feature, you'll need to use PCRE instead:

# BRE greedy
$ grep -o 'c.*s' <<< 'can cats eat plants?'
can cats eat plants

# BRE lazy
$ grep -o 'c.*\?s' <<< 'can cats eat plants?'
can cats eat plants

# ERE greedy
$ grep -E -o 'c.*s' <<< 'can cats eat plants?'
can cats eat plants

# ERE lazy
$ grep -E -o 'c.*?s' <<< 'can cats eat plants?'
can cats eat plants

# PCRE greedy
$ grep -P -o 'c.*s' <<< 'can cats eat plants?'
can cats eat plants

# PCRE lazy
$ grep -P -o 'c.*?s' <<< 'can cats eat plants?'
can cats

Edit 1

Could you please explain a little about .* vs .*? ?

  • .* is used to match the "longest"1 pattern possible.

  • .*? is used to match the "shortest"1 pattern possible.

In my experience, the most wanted behavior is usually the second one.

For example, let's say we have the following string and we only want to match the html tags2, not the content between them:

<title>My webpage title</title>

Now compare .* vs .*?:

# Greedy
$ grep -P -o '<.*>' <<< '<title>My webpage title</title>'
<title>My webpage title</title>

# Lazy
$ grep -P -o '<.*?>' <<< '<title>My webpage title</title>'
<title>
</title>

1. The meaning of "longest" and "shortest" in a regex context is a bit tricky, as Kusalananda pointed out. Refer to official documentation for more information.
2. It's not recommended to parse html with regex. This is just an example for educational purposes, don't use it in production.