Regex/token/rule to match nested curly braces?

After perusing Lenz' "Parsing with Perl 6 Regexes and Grammars" (Apress, 2017), I realized the "regex" machinery (based on backtracking) might actually be a lot more capable than officially admitted, as a regex can call another, and nowhere do I see a prohibition on recursive calls.

Before digging in, a bit of context free grammars: A way to describing nested braces (and nothing else) is with the grammar:

S -> { S } S | <nothing>

I.e., nested braces are either an opening brace, nested braces, a closing brace, more nested braces; or nothing whatsoever. This translates more or less directly to Raku (there is no empty regex, fake it by making the construction optional):

my regex nb {
   [ '{' <nb> '}' <nb> ]?
}

Lo and behold, this works. Need to fix up to avoid captures, kill backtracking (if it doesn't match on the first try, it won't ever match), and decorate with "anything else" fillers.

my regex nested-braces {
    :ratchet 
     <-[{}]>*
     [ '{' <.nested-braces> '}' <.nested-braces> ]?
     <-[{}]>*
};

This checks out with my test cases.

For not-so-adventurous souls, there is the Text::Balanced module for Perl (formerly Perl 5, callable from Raku using Inline::Perl5). Not directly useful to me inside a grammar, unfortunately.


Solution

A way to describe nested braces (and nothing else)

Presuming a rule named &R, I'd likely write the following pattern if I was writing a quick small one-off script:

\{ <&R>* \} 

If I was writing a larger program that should be maintainable I'd likely be writing a grammar and, using a rule named R the pattern would be:

'{' ~ '}' <R>*

This latter avoids leaning toothpick syndrome and uses the regex ~ operator.

These will both parse arbitrarily deeply nested paired braces, eg:

say '{{{{}}}}' ~~ token { \{ <&?ROUTINE>* \} } # 「{{{{}}}}」

(&?ROUTINE refers to the routine in which it appears. A regex is a routine. (Though you can't use <&?ROUTINE> in a regex declared with / ... / syntax.)

regex vs token

kill backtracking

my regex nested-braces {
    :ratchet 

The only difference between patterns declared with regex and token is that the former turns ratcheting off. So using it and then immediately turning ratcheting on is notably unidiomatic. Instead:

my token nested-braces {

Backtracking

the "regex" machinery (based on backtracking)

The grammar/regex engine does include backtracking as an optional feature because that's occasionally exactly what one wants.

But the engine is not "based on backtracking", and many grammars/parsers make little or no use of backtracking.

Recursion

a regex can call another, and nowhere do I see a prohibition on recursive calls.

This alone is nothing special for contemporary regex engines.

PCRE has supported recursion since 2000, and named regexes since 2003. Perl's default regex engine has supported both since 2007.

Their support for deeper levels of recursion and more named regexes being stored at once has been increasing over time.

Damian Conway's PPR uses these features of regexes to build non-trivial (but still small) parse trees.

Capabilities

a lot more capable

Raku "regexes" can be viewed as a cleaned up take on the unfolding regex evolution. To the degree this helps someone understand them, great.

But really, it's a whole new deal. For example, they're turing complete, in a sensible way, and thus able to parse anything.

than officially admitted

Well that's an odd thing to say! Raku's Grammars are frequently touted as one of Raku's most innovative features.

There are three major caveats:

  • Performance The primary current caveat is that a well written C parser will blow the socks off a well written Raku Grammar based parser.

  • Pay off It's often not worth the effort it takes to write a fully correct parser for a non-trivial format if there's an existing parser.

  • Left recursion Raku does not automatically rewrite left recursion (infinite loops).

Using existing parsers

I know there are BibTeX parsers around, but I need to grab the complete entry for further processing, and peek at a few keys meanwhile.

Using a foreign module in Raku can be a bit of a revelation. It is not necessarily like anything you'll have experienced before. Raku's foreign language adaptors can do smart marshaling for you so it can be like you're using native Raku features.

Two of the available foreign language adaptors are already sufficiently polished to be amazing -- the ones for Perl and for C.

I'm pretty sure there's a BibTeX package for Perl that wraps a C BibTeX parser. If you used that you'd hopefully get parsing results all nicely wrapped up into Raku objects as if it was all Raku in the first place, but retaining much of the high performance of the C code.

A Raku BibTeX Grammar?

Perhaps your needs do call for creating and using a small Raku Grammar.

(Maybe you're doing this partly as an exercise to familiarize yourself with Raku, or the regex/grammar aspect of Raku. For that it sounds pretty ideal.)

As soon as you begin to use multiple regexes together -- even just two -- you are closing in on grammar territory. After all, they're just an easy-to-use construct for using multiple regexes together.

So if you decide you want to stick with writing parsing code in Raku, expect to write it something like this:

grammar BiBTeX {
  token TOP { ... }
  token ...
  token ...
}
BiBTeX.parse: my-bib-file

For more details, see the official doc's Grammar tutorial or read Moritz's book.

Tags:

Raku

Grammar