Is it safe to let a user type a regex as a search input?

I would compare accepting user supplied regular expressions to parsing most sorts of structured user input, such as date strings or markdown, in terms of risk of code execution. Regular expressions are much more complex than date strings or markdown (although safely producing html from untrusted markdown has its own risks) and so represents more room for exploitation, but the basic principle is the same: exploitation involves finding unexpected side effects of the parsing/compilation/matching process.

Most regex libraries are mature and part of the standard library in many languages, which is a pretty good (but not certain) indicator that it's free of major issues leading to code execution.
That is to say, it does increase your attack surface, but it's not unreasonable to make the measured decision to accept that relatively minor risk.

Denial of service attacks are a little trickier. I think most regular expression libraries are designed with performance in mind but do not count mitigation of intentionally slow input among their core design goals. The appropriateness of accepting user supplied regular expressions from the DoS perspective is more library dependent.
For example, the .NET regex library accepts a timeout which could be used to mitigate DoS attacks.
RE2 guarantees execution in time linear to input size which may be acceptable if you know your search corpus falls within some reasonable size limit.

In situations where availability is absolutely critical or you're trying to minimize your attack surface as much as possible it makes sense to avoid accepting user regex, but I think it's a defensible practice.


The main threat in accepting regular expressions will be in your regex execution engine rather than accepting regex itself. I'd expect the threat to be very, very low in any well implemented engine. The engine shouldn't need access to any privileged system resources and should only need to run logic on input provided directly to the engine. This means that even if someone finds an exploit in the interpreter, the damage that can be done should be minimal.

Overall, all regex is designed to do is look for patterns within a value. As long as proper security is followed on the values you check against, there is no reason the engine itself should have any access to modify values. I'd classify it as generally pretty safe.

That said, I'd also only provide it in situations where it made reasonable sense to do so. Regex is complex, potentially time consuming to run, and used in the wrong places could have some undesirable impacts on an application outside of a security context, but in the right use case they are hugely powerful and immensely valuable. (I'm a software architect who refactors hundreds of thousands of lines of code regularly using regex.)


As the other answers have pointed out, the attack vector would most possibly be the regex engine.

While you would assume that these engines are quite mature, robust and thoroughly tested, it did happen in the past:

CVE-2010-1792 Arbitrary Code Execution in Apple Safari and iOS. Quote from the Patch notes:

A memory corruption issue exists in WebKit's handling of regular expressions. Visiting a maliciously crafted website may lead to an unexpected application termination or arbitrary code execution.

But of course, the argument of a possibly flawed library holds for everything - even user-provided JPEG files.

The other aspect, albeit not inherently technical, would be the (.+) case you mentioned: Should the product allow arbitrary data retrieval?