perl6 grammar , not sure about some syntax in an example

The {} is an empty code block. It's a procedural (instead of declarative) element of grammars. You could put regular Perl 6 code in there to have it do something.

In this pattern it's doing another job. It provides a sequence point where the grammar engine knows it needs to do various things to continue. This includes filling in values for the capture variables (such as $<quote>). The next part of the pattern needs to ensure that $<quote> has its value so it needs something to ensure that value is available.

The $<quote> is actually a single element access to the Match object $/. As a hash-like thing, that's really $/<quote> where the thing between the angle brackets is the "key". Perl 6 likes to be a little clever so it lets you leave off the / to get $<quote>. Other match variables such as $1 are similarly shortcuts.

For your last question it would help to see some sample data you are trying to match. Perl 6 Grammars has many features to match balanced text which probably makes the task trivial. See, for instance, Tilde for nesting structures in the Regexp documentation:

 / '(' ~ ')' <expression> /

Here's a short example in the REPL. There's a string that has some quoted text in it:

$ perl6
To exit type 'exit' or '^D'
> my $s = Q/abcdf "Hello" xyz/
abcdf "Hello" xyz

The ~ in the regex is between the delimiters. The thing that comes after the end delimiter is the stuff you expect to be where the ~ is:

> $s ~~ m/ '"' ~ '"' .+ /
「"Hello"」

You could match the opening thing and capture it (now it's in $0) so you can use the exact same thing as the closing delimiter:

> $s ~~ m/ (<["']>) ~ $0 .+ /
「"Hello"」
 0 => 「"」

For that particular example I think there's a simpler way. Match an escaped quote or anything that's not a quote instead of a look around and any character. That's not quite as mind bending.


TL;DR @briandfoy has provided an easy to digest answer. But here be dragons that he didn't mention. And pretty butterflies too. This answer goes deep.

Question 1: what is this {} in the token doing?

It's a code block1,2,3,4.

It's an empty one and has been inserted purely to force the $<quote> in quotebody($<quote>) to evaluate to the value captured by the <quote> at the start of the regex.

The reason why $<quote> does not contain the right value without insertion of a code block is a Rakudo Perl 6 compiler limitation or bug related to "publication of match variables".

"Publication" of match variables by Rakudo

Moritz Lenz states in a Rakudo bug report that "the regex engine doesn't publish match variables unless it is deemed necessary".

By "regex engine" he means the regex/grammar engine in NQP, part of the Rakudo Perl 6 compiler.3

By "match variables", he means the variables that store captures of match results:

  • the current match variable $/;

  • the numbered sub-match variables $0, $1, etc.;

  • named sub-match variables of the form $<foo>.

By "publish" he means that the regex/grammar engine does what it takes so that any mentions of any variables in a regex (a token is also a regex) evaluate to the values they're supposed to have when they're supposed to have them. Within a given regex, match variables are supposed to contain a Match object corresponding to what has been captured for them at any given stage in processing of that regex, or Nil if nothing has been captured.

By "deemed necessary" he means that the regex/grammar engine makes a conservative call about whether it's worth doing the publication work after each step in the matching process. By "conservative" I mean that the engine often avoids doing publication, because it slows things down and is usually unnecessary. Unfortunately it's sometimes too optimistic about when publication is actually necessary. Hence the need for programmers to sometimes intervene by explicitly inserting a code block to force publication of match variables (and other techniques for other variables5). It's possible that the regex/grammar engine will improve in this regard over time, reducing the scenarios in which manual intervention is necessary. If you wish to help progress this, please create test cases that matter to you for existing related bugs.5

"Publication" of $<quote>'s value

The named capture $<quote> is the case in point here.

As far as I can tell, all sub-match variables correctly refer to their captured value when written directly into the regex without a surrounding construct. This works:

my regex quote { <['"]> }
say so '"aa"' ~~ / <quote> aa $<quote> /; # True

I think6$<quote> gets the right value because it is parsed as a regex slang construct.4

In contrast, if the {} were removed from

token string { <quote> {} <quotebody($<quote>)> $<quote> }

then the $<quote> in quotebody($<quote>) would not contain the value captured by the opening <quote>.

I think this is because the $<quote> in this case is parsed as a main slang construct.

Question 2a: escaped($quote) inside <> would be a regex function, right? And it takes $quote as an argument

That's a good first approximation.

More specifically, regex atoms of the form <foo(...)> are calls of the method foo.

All regexes -- whether declared with token, regex, rule, /.../ or any other form -- are methods. But methods declared with method are not regexes:

say Method ~~ Regex; # False
say WHAT token { . } # (Regex)
say Regex ~~ Method; # True
say / . / ~~ Method; # True

When the <escaped($quote)> regex atom is encountered, the regex/grammar engine doesn't know or care if escaped is a regex or not, nor about the details of method dispatch inside a regex or grammar. It just invokes method dispatch, with the invocant set to the Match object that's being constructed by the enclosing regex.

The call yields control to whatever ends up running the method. It typically turns out that the regex/grammar engine is just recursively calling back into itself because typically it's a matter of one regex calling another. But it isn't necessarily so.

and returns another regex

No, a regex atom of the form <escaped($quote)> does not return another regex.

Instead it calls a method that will/should return a Match object.

If the method called was a regex, P6 will make sure the regex generates and populates the Match object automatically.

If the method called was not a regex but instead just an ordinary method, then the method's code should have manually created and returned a Match object. Moritz shows an example in his answer to the SO question Can I change the Perl 6 slang inside a method?.

The Match object is returned to the "regex/grammar engine" that drives regex matching / grammar parsing.3

The engine then decides what to do next according to the result:

  • If the match was successful, the engine updates the overall match object corresponding to the calling regex. The updating may include saving the returned Match object as a sub-match capture of the calling regex. This is how a match/parse tree gets built.

  • If the match was unsuccessful, the engine may backtrack, undoing previous updates; thus the parse tree may dynamically grow and shrink as matching progresses.

Question 2b: If I want to indicate "char that is not before quote", should I use . <!before $quote> instead of <!before $quote> . ??

Yes.

But that's not what's needed for the quotebody regex, if that's what you're talking about.

While on the latter topic, in @briandfoy's answer he suggests using a "Match ... anything that's not a quote" construct rather than doing a negative look ahead (<!before $quote>). His point is that matching "not a quote" is much easier to understand than "are we not before a quote? then match any character".

However, it is by no means straight-forward to do this when the quote is a variable whose value is set to the capture of the opening quote. This complexity is due to bugs in Rakudo. I've worked out what I think is the simplest way around them but think it likely best to just stick with use of <!before $quote> . unless/until these long-standing Rakudo bugs are fixed.5

token escaped($quote) { '\\' ( $quote | '\\' ) } # I think this is a function;

It's a token, which is a Regex, which is a Method, which is a Routine:

say token { . } ~~ Regex;   # True
say Regex       ~~ Method;  # True
say Method      ~~ Routine; # True

The code inside the body (the { ... } bit) of a regex (in this instance the code is the lone . in token { . }, which is a regex atom that matches a single character) is written in the P6 regex "slang" whereas the code used inside the body of a method routine is written in the main P6 "slang".4

Using ~

The regex tilde (~) operator is specifically designed for the sort of parsing in the example this question is about. It reads better inasmuch as it's instantly recognizable and keeps the opening and closing quotes together. Much more importantly it can provide a human intelligible error message in the event of failure because it can say what closing delimiter(s) it's looking for.

But there's a key wrinkle you must consider if you insert a code block in a regex (with or without code in it) right next to the regex ~ operator (on either side of it). You will need to group the code block unless you specifically want the tilde to treat the code block as its own atom. For example:

token foo { <quote> ~ $<quote> {} <quotebody($<quote>) }

will match a pair of <quote>s with nothing between them. (And then try to match <quotebody...>.)

In contrast, here's a way to duplicate the matching behavior of the string token in the String::Simple::Grammar grammar:

token string { <quote> ~ $<quote> [ {} <quotebody($<quote>) ] }

Footnotes

1 In 2002 Larry Wall wrote "It needs to be just as easy for a regex to call Perl code as it is for Perl code to call a regex.". Computer scientists note that you can't have procedural code in the middle of a traditional regular expression. But Perls long ago led the shift to non-traditional regexes and P6 has arrived at the logical conclusion -- a simple {...} is all it takes to insert arbitrary procedural code in the middle of a regex. The language design and regex/grammar engine implementation3 ensure that traditional style purely declarative regions within a regex are recognized, so that formal regular expression theory and optimizations can be applied to them, but nevertheless arbitrary regular procedural code can also be inserted. Simple uses include matching logic and debugging. But the sky's the limit.

2 The first procedural element of a regex, if any, terminates what's called the "declarative prefix" of the regex. A common reason for inserting an empty code block ({}) is to deliberately terminate a regex's declarative prefix when that provides the desired matching semantics for a given longest alternation in a regex. (But that isn't the reason for its inclusion in the token you're trying to understand.)

3 Loosely speaking, the regex / grammar engine in NQP is to P6 what PCRE is to P5.

A key difference is that the regex language, along with its associated regex/grammar engine, and the main language it cooperates with, which in the case of Rakudo is Perl 6, are co-equals control-wise. This is an implementation of Larry Wall's original 2002 vision for integration between regexes and "rich languages". Each language/run-time can call into the other and communicate via high level FFIs. So they can appear to be, can behave as, and indeed are, a single system of cooperating languages and cooperating run-times.

(The P6 design is such that all languages can be explicitly designed, or be retro-fitted, to cooperate in a "rich" manner via two complementary P6 FFIs: the metamodel FFI 6model and/or the C calling convention FFI NativeCall.)

4 The P6 language is actually a collection of sub-languages -- aka slangs -- that are used together. When you are reading or writing P6 code you are reading or writing source code that starts out in one slang but has sections written in others. The first line in a file uses the main slang. Let's say that's analogous to English. Regexes are written in another slang; let's say that's like Spanish. So in the case of the grammar String::Simple::Grammar, the code begins in English (the use v6; statement), then recurses into Spanish (after the { of rule TOP {), i.e. the ^ <string> $ bit, and then returns back out into English (the comment starting # Note ...). Then it recurses back into Spanish for <quote> {} <quotebody($<quote>)> $<quote> and in the middle of that Spanish, at the {} codeblock, it recurses into another level of English again. So that's English within Spanish within English. Of course, the code block is empty, so it's like writing/reading nothing in English and then immediately dropping back into Spanish, but it's important to understand that this recursive stacking of languages/run-times is how P6 works, both as a single overall language/run-time and when cooperating with other non-P6 languages/run-times.

5 I encountered several bugs, listed at the end of this footnote, in the process of applying two potential improvements. (Both mentioned in briandfoy's answer and this one.) The two "improvements" are use of the ~ construct, and a "not a quote" construct instead of using <!before foo> .. The final result, plus mention of pertinent bugs:

grammar String::Simple::Grammar {
  rule TOP {^ <string> $}
  token string {
    :my $*not-quote;
    <quote> ~ $<quote>
    [
      { $*not-quote = "<-[$<quote>]>" }
      <quotebody($<quote>)>
    ]
  }
  token quote { '"' | "'" }
  token quotebody($quote) { ( <escaped($quote)> | <$*not-quote> )* }
  token escaped($quote) { '\\' ( $quote | '\\' ) }
}

If anyone knows of a simpler way to do this, I'd love to hear about it in a comment below.

I ended up searching the RT bugs database for all regex bugs. I know SO isn't bug database but I think it's reasonable for me to note the following ones. Aiui the first two directly interact with the issue of publication of match variables.

  • "the < > regex call syntax looks up lexicals only in the parent scope of the regex it is used in, and not in the scope of the regex itself." rt #127872

  • Backtracking woes as they relate to passing arguments in regex calls

  • It looks like there are lots of nasty threading bugs. Most boil down to the fact that several regex features use EVAL behind the scenes and EVAL is not yet thread-safe. Fortunately the official doc mentions these.

  • Can't do recursive grammars due to .parse setting $/.

6 This question and my answer has pushed me to the outer limits of my understanding of an ambitious and complex aspect of P6. I plan to soon gain greater insight into the precise interactions between nqp and full P6, and the hand-offs between their regex slangs and main slangs, as discussed in footnotes above. (My hopes currently largely rest on having just bought commaide.) I'll update this answer if/when I have some results.