Using 'after' as lookbehind in a grammar in raku

When we parse a string using a grammar, the matching is anchored to the start of the string. Parsing the input with parse requires us to consume all of the string. There is also a subparse, which allows us to not consume all of the input, but this is still anchored to the start of the string.

By contrast, a regex like /<?after \n\n>LUKE/ will scan through the string, trying to match the pattern at each position in the string, until it finds a position at which it matches (or gets to the end of the string and gives up). This is why it works. Note, however, that if your goal is to not capture the \n\n, then you could instead have written the regex as /\n\n <( LUKE/, where <( indicates where to start capturing. At least on the current Rakudo compiler implementation, this way is more efficient.

It's not easy to suggest how to write the grammar without a little more context (I'm guessing this is extracted from a larger problem). You could, for example, consume whitespace at the start of the grammar:

grammar MyGrammar {

    token TOP {
        \s+ <character>
    }

    token character {
        <?after \n\n>LUKE
    }
}

say MyGrammar.subparse("\n\nLUKE");

Or consume the \n\n in character but exclude it from the match with <(, as mentioned earlier.

`<?after ...>` does not advance the match cursor

Of crucial import here is that <?after \n\n> is a "zero width" assertion.

It matches if the match cursor is sitting to the immediate right of "\n\n" in the string being matched, but it doesn't advance the match cursor.

Why the `~~ / ... /` version matches

The regex/grammar engine is automatically advancing the match cursor for you.

A plain regex-style match works like traditional regexes. In particular, it is supposed to match anywhere in the string being matched, unless you explicitly add anchors such as ^ (start of string) and/or $ (end of string).

More explicitly, the match engine will start by trying to match at the first character position of a string being matched. Then, if that fails, it'll automatically move forward one character in the string, and then try again to match from the start of the regex pattern.

So all of these will also match and give the same result:

"\n\nLUKE" ~~ /LUKE/;                     # ｢LUKE｣
"\n\nLUKE" ~~ /LUKE $/;                   # ｢LUKE｣
"LUKE"     ~~ /^ LUKE $/;                 # ｢LUKE｣
"\n\nLUKE" ~~ / <?after \n\n>LUKE $/;     # ｢LUKE｣

Why the grammar version doesn't match

A grammar is expected to match starting at the start of the input string. Otherwise it fails.

More explicitly, .parse has implicit ^ and $ anchors at the start and end of a parse, and .subparse has an implicit ^ at the start.

If the match cursor fails to progress past the first character then the parse fails. Your grammar doesn't progress the match cursor past the first character, so it fails.

(The <?after \n\n> not only would fail to advance the cursor if it matched, it never even matches in the first place -- because at the start of the string the match cursor is only after nothing. If you had written <?after ''> instead, then that would always succeed, but would still not advance the cursor, so the grammar would still fail if that's the only change you made.)

The current answers are excellent, but let me be a bit more verbose in explaining the origin of the misunderstanding. The main point is that here you're comparing a token that is part of a grammar with a standalone regex. They use the same language, regular expressions, but they are not the same. You can use a regex to match, substitute and extract information; the objective of a token is purely extracting information; from a string with a regular estructure, I want a part and just that part. I assume you're insterested in the LUKE part, and that you are using <after to kinda express "No, not what I'm interested this", or "Skip this, get me only the goods". Jonathan has already said one way, probably the best, to do so:

grammar MyGrammar {

    token TOP {
        <character>
    }

    token character {
         \n \n <( LUKE
    }
}

say MyGrammar.subparse("\n\nLUKE");

Will not only math, but also only capture LUKE:

｢

LUKE｣
 character => ｢LUKE

skipping over that. However, grammars don't match, they extract. So you probably want the separators to also be in the grammar, not worth the while to repeat them over and over. Besides, in general grammars are intended to be used top-down. So this will do:

grammar MyGrammar {

    token TOP {
        <separator><character>
    }

    token separator { \n \n }
    token character { <[A..Z]>+  }
}

say MyGrammar.parse("\n\nLUKE");

The character token is now more general (although maybe it coud use some whitespaces, I don't know. Again, maybe you're not interested in the separator. Just use a dot to ignore it. Just because you're not interested does not mean you don't have to parse it, and grammars give you a way of doing it:

grammar MyGrammar {

    token TOP {
        <.separator><character>
    }

    token separator { \n \n }
    token character { <[A..Z]>+  }
}

say MyGrammar.parse("\n\nLUKE");

This one gives the same result:

｢

LUKE｣
 character => ｢LUKE｣

At the end of the day, grammars and regexes have different use cases, and thus different solutions for the same objective. Thinking about them in the proper way gives you a hint on how to structure them.

Using 'after' as lookbehind in a grammar in raku

`<?after ...>` does not advance the match cursor

Why the `~~ / ... /` version matches

Why the grammar version doesn't match

Tags:

Regex

Raku

Grammar

Related

Recent Posts

Using 'after' as lookbehind in a grammar in raku

<?after ...> does not advance the match cursor

Why the ~~ / ... / version matches

Why the grammar version doesn't match

Tags:

Regex

Raku

Grammar

Related

`<?after ...>` does not advance the match cursor

Why the `~~ / ... /` version matches