Check if simple regex matches string

Haskell, 203 bytes

Nobody had done this by implementing a small regex engine yet, and I felt like it had to be done. This obviously won't win. but I'm hoping it will inspire someone to write an even more golfed regex engine.

I've rewritten my solution to avoid directly parsing the regular expression into its AST. Instead, the parsing process constructs a function that is used to match a string against the input regex.

The main function is (&) :: String -> String -> Bool which takes a string representation of a regex and a string to test, returning a boolean value. This calls to the next function which handles most of the work in parsing the regex and matching the string.

Function p :: String -> ([String] -> [String], String) takes a string representation of a regex and returns as the first element of a tuple a function that returns a list of all possible unmatched suffixes of strings in the input list after satisfying the regex parsed from the input string. The regex fully matches the string if the empty string is contained in the list of possible unmatched suffixes.

r&s=elem""$fst(p r)[s]
p(c:t)|c>'`'=t% \s->[t|h:t<-s,c==h]|c>'^'=t%id|(l,o:t)<-p t,(r,_:u)<-p t=u%last(r.l:[\s->r s++l s|o>'+'])
m#s=s++filter(`notElem`s)(m s)
('*':t)%m=t%until(\s->s==m#s)(m#)
s%m=(m,s)

Try it online!

To get rid of one byte, I replaced import Data.List; m#s=nub$s++m s with m#s=s++filter(`notElem`s)(m s). These functions aren't equivalent if there are duplicate elements in either s of m s. The new function does, however, remove all elements from m s that already exist in s, so until still terminates once no new suffixes are discovered by the application of m.

Ungolfed Code

import Data.List

match :: String -> String -> Bool
match r s =elem ""$(fst $ parseRegex r)[s]

parseRegex :: String -> ([String] -> [String], String)
parseRegex ('_':t) = parseKleene id t
parseRegex (c:t) | c >= 'a' = parseKleene (>>=p) t
  where p (c':t')| c==c' = [t']
        p _ = []
parseRegex ('(':t) =
  let (l, (o:t')) = parseRegex t in
  let (r, (_:t'')) = parseRegex t' in
  parseKleene (if o=='+' then (r.l) else (\ss-> (r ss)++(l ss))) t''

parseKleene :: ([String] -> [String]) -> String -> ([String] -> [String], String)
parseKleene p ('*':t) = parseKleene p' t
  where
    p' ss
      | ss' <- nub$ p ss,
        ss /= ss' = ss ++ (p' ss')
      | otherwise = ss
parseKleene p s = (p,s)

GolfScript, 198 bytes

I was able to beat my Haskell solution by implementing the first algorithm I tried in GolfScript instead of Haskell. I don't think it's interesting enough for a separate answer, so I'll just leave it here. There are likely some golfing opportunities since I learned GolfScript just for this.

This solution is in the form of a block that expects the test string on the top of the stack followed by the regex string.

{[.;]\1+{(.96>{0[\]}{2%0{r\(\r\:s;[@]s(;\}i}{if}:i~\{(.3%}{;\2[\]\}until[.;]\+\}:r~\;{.{(.{.4%{2%{)@\m\)\;m}{)\;{.@.@m 1$|.@={\;}{\o}i}:o~}i}{;)@.@m@@\)\;m|}i}{;(:c;;{,},{(\;c=},{(;}%}i}{;}i}:m~""?}

Try it online!


APL (Dyalog Unicode), 39 bytesSBCS

Edit: now works with runs of * even after _

Full program. Prompts stdin for string and then for regex. Returns a list consisting of an empty list (by default, this prints as two spaces) for matches and an empty list (empty line) for non-matches.

(1⌽'$^','\*+' '_'⎕R'*' '()'⊢⍞~'+')⎕S⍬⊢⍞

Try it online! (output made easier to read by converting all output to JSON)

 prompt stdin (for string)

 on that, apply the following:

()⎕S⍬ PCRE Search for the following, returning an empty list for each match

~'+' remove all plusses from the following:

 prompt stdin (for regex)

 on that, apply the following:

'\*+' '_'⎕R'*' '()' PCRE Replace runs of * with * and _ with ()

'$^', prepend dollar sign and caret (indicating end and start)

1⌽ rotate the first character ($) to the end


APL (Dyalog Unicode), 295 277 bytes

a←819⌶⎕A
E←{⍵{(⍺⊆⍨~⍵),⍺[⍸⍵]}(⍵∊'|+')∧0=+\-⌿'()'∘.=⍵}1↓¯1↓⊢
M←{c←⊃⌽⍵⋄c∊'0',a:0⋄c∊'_*':1⋄r s o←E⍵⋄o='|':∨/∇¨r s⋄∧/∇¨r s}
D←{c←⊃⌽⍵⋄c∊'0_':'0'⋄c=⍺:'_'⋄c∊a:'0'⋄c='*':1⌽∊')('(⍺∇¯1↓⍵)'+'⍵⋄r s o←E⍵⋄o='|':1⌽∊')('(⍺∇r)'|',⍺∇s⋄M r:1⌽∊')(('(⍺∇r)'+'s')|',⍺∇s⋄1⌽∊')('(⍺∇r)'+'s}
{M⊃D/(⌽⍵),⊂⍺}

Try it online!

-18 bytes thanks to @ngn.

This is a proof of concept that we can do a "simple regex matching" without any backtracking, thus avoiding possible infinite loops due to _* or r**. This is also a showcase that APL is a general-purpose programming language.

The anonymous function at the last line does the regex matching; use it as (regex) f (input string). The return value is 1 if the match is successful, 0 otherwise.

Concept

Given a simple regex R and the first character c of input string, we can construct (or derive) another simple regex R' that matches exactly the strings s where the original R matches c+s.

$$ \forall R \in \text{simple regex}, c \in \text{[a-z]}, s \in \text{[a-z]*}, \\ \exists R' \in \text{simple regex}, R' =\sim s \iff R =\sim c+s $$

Combine this with a tester which checks if r matches an empty string (epsilon), and we get a fully working simple regex matcher: given a regex \$ R_0 \$ and string \$ s = c_1 c_2 \cdots c_n \$, sequentially derive \$ R_0, c_1 \rightarrow R_1, c_2 \rightarrow R_2 \cdots \rightarrow R_n \$ and then test if \$ R_n \$ matches epsilon.

My code uses the following algorithm for testing epsilon match (MatchEps) and computing R' from R and c (Derive).

T = True, F = False
0 = null regex (never matches)
_ = "empty string" regex
a = single-char regex
r, s = any (sub-)regex

MatchEps :: regex -> bool
MatchEps 0 = F    # Null regex can't match empty string
MatchEps _ = T    # Empty-string regex trivially matches empty string
MatchEps a = F    # Single-char can't match
MatchEps r* = T   # Kleene matches as zero iteration
MatchEps (r|s) = MatchEps r or MatchEps s
MatchEps (r+s) = MatchEps r and MatchEps s

Derive :: char -> regex -> regex
# No matching string at all
Derive c 0 = 0
# _ can't match any string that starts with c
Derive c _ = 0
# Single-char regex only matches itself followed by empty string
Derive c a = if c == 'a' then _ else 0
# r* matches either _ or (r+r*);
# _ can't start with c, so it must be first `r` of (r+r*) that starts with c
Derive c r* = ([Derive c r]+r*)
# r or s; simply derive from r or derive from s
Derive c (r|s) = ([Derive c r]|[Derive c s])
# r followed by s; it matters if r can match _
Derive c (r+s) =
  # if r matches _, either [r starts with c] or [r matches _ and s starts with c]
  if MatchEps r then (([Derive c r]+s)|[Derive c s])
  # otherwise, r always starts with c
  else ([Derive c r]+s)

Ungolfed, with comments

⍝ Unwrap single layer of (...) and extract (r, s, op) from (r|s) or (r+s)
ExtractRS←{⍵{(⍺⊆⍨~⍵),⍺[⍸⍵]}(⍵∊'|+')∧0=+\-⌿'()'∘.=⍵}1↓¯1↓⊢
  ⍝ 1↓¯1↓⊢    Drop the outermost ()
  ⍝ {...}     Pass the result to the function as ⍵...
  ⍝   +\-⌿'()'∘.=⍵    Compute the layers of nested ()s
  ⍝   (⍵∊'|+')∧0=     Locate the operator (`|` or `+`) as bool vector
  ⍝   ⍵{...}          Pass to inner function again ⍵ as ⍺, above as ⍵
  ⍝     ⍺[⍸⍵]     Extract the operator
  ⍝     (⍺⊆⍨~⍵),  Prepend the left and right regexes

⍝ Tests if the given regex matches an empty string (epsilon, eps)
MatchEps←{
    c←⊃⌽⍵                 ⍝ Classify the regex by last char
    c∊'0',819⌶⎕A:0        ⍝ 0(no match) or lowercase: false
    c∊'_*':1              ⍝ _(empty) or Kleene: true
    r s op←ExtractRS ⍵    ⍝ The rest is (r|s) or (r+s); extract it
    op='|': ∨/∇¨r s       ⍝ (r|s): r =~ eps or s =~ eps
    ∧/∇¨r s               ⍝ (r+s): r =~ eps and s =~ eps
}

⍝ Derives regex `R'` from original regex `R` and first char `c`
Derive←{
    c←⊃⌽⍵             ⍝ Classify the regex by last char
    c∊'0_':,'0'       ⍝ 0 or _ doesn't start with any c
    c=⍺:,'_'          ⍝ Single char that matches
    c∊819⌶⎕A:'0'      ⍝ Single char that doesn't match
    c='*': '(',(⍺∇¯1↓⍵),'+',⍵,')'    ⍝ One char from Kleene: (R*)' = (R'+R*)
    r s op←ExtractRS ⍵               ⍝ Extract (r|s) or (r+s)
    op='|': '(',(⍺∇r),'|',(⍺∇s),')'  ⍝ (r|s): one char from either branch
    MatchEps r: '((',(⍺∇r),'+',s,')|',(⍺∇s),')'   ⍝ (r+s) and r =~ eps: ((r'+s)|s')
    '(',(⍺∇r),'+',s,')'                           ⍝ (r+s) but not r =~ eps: (r'+s)
}

⍝ Main function: Fold the string by Derive with initial regex,
⍝                and then test if the result matches eps
f←{MatchEps⊃Derive/(⌽⍵),⊂⍺}

Final note

This is not an original idea of mine; it is part of a series of exercises on a theorem proving textbook. I can claim that the algorithm is proven to work (because I did complete the correctness proofs), though I can't open the entire proof to public.