Raku Grammar: Use named regex without consuming matching string

The .*? works but is inefficient.
It has to do a lot of backtracking.

To improve it you could use \N* which matches everything except a newline.

grammar Grammar::Entries {
    rule TOP { <logentries>+ }

    token logentries { <loglevel> <logentry> }
    token loglevel { 'DEBUG' | 'WARN' | 'INFO' | 'ERROR' }
    token logentry { \N* \n }
}

Then you would have to add the newline matching back in.

    token logentry {
      <logline>* %% \n
    }
    token logline { <!before \w> \N* }

This would work, but it still isn't great.


I would structure the grammar more like the thing you are trying to parse.

grammar Grammar::Entries {
    token TOP { <logentries>+ }

    token logentries { <loglevel> <logentry> }
    token loglevel { 'DEBUG' | 'WARN' | 'INFO' | 'ERROR' }
    token logentry { <logline>* }
    token logline { '    ' <(\N+)> \n? }
}

Since I noticed that the log lines always start with 4 spaces, we can use that to make sure that only lines that start with that are counted as a logline. This also deals with the remaining data on the line with the log level.

I really don't like that you have a token with a plural name that only matches one thing.
Basically I would name logentries as logentry. Of course that means that logentry needs to change names as well.

grammar Grammar::Entries {
    token TOP { <logentry>+ }

    token logentry { <loglevel> <logdata> }
    token loglevel { 'DEBUG' | 'WARN' | 'INFO' | 'ERROR' }
    token logdata { <logline>* }
    token logline { '    ' <(\N+)> \n? }
}

I also don't like the redundant log appended to every token.

grammar Grammar::Entries {
    token TOP { <entry>+ }

    token entry { <level> <data> }
    token level { 'DEBUG' | 'WARN' | 'INFO' | 'ERROR' }
    token data { <line>* }
    token line { '    ' <(\N+)> \n? }
}

So what this says is that a Grammar::Entries consist of at least one entry.
An entry starts with a level, and ends with some data.
data consists of any number of lines
A line starts with four spaces, at least one non-newline, and may end with a newline.


The point I'm trying to make is to structure the grammar the same way that the data is structured.

You could even go and add the structure for pulling out the information so that you don't have to do that as a second step.


as far as I know <.loglevel> means non-capturing.

It means non-capturing (don't hold onto the match so code can access it later), not non-matching.

What you want to do is match without advancing the match position, a so-called "zero-width assertion". I haven't tested this but expect it to work (famous last words):

grammar Grammar::Entries {
    rule TOP { <logentries>+ }

    token logentries { <loglevel> <logentry> }
    token loglevel { 'DEBUG' | 'WARN' | 'INFO ' | 'ERROR' }
    token logentry { .*? <.finish> }
    token finish { <?loglevel> || $ }     # <-- the change
}

Tags:

Raku

Grammar