Alternation in regexes seems to be terribly slow in big files

Inspection of the example file reveals that the strings only occur as a whole line, starting with a tab and a space. From your responses I further gathered that you're really only interested in counts. If that is the case, then I would suggest something like this solution:

my %targets = "\t Low", "Low", "\t Medium", "Medium", "\t High", "High";
my %vulnerabilities is Bag = $g.lines.map: {
    %targets{$_} // Empty
}
dd %vulnerabilities;  # ("Low"=>2877,"Medium"=>54).Bag

This runs in about .25 seconds on my machine.

It always pays to look at the problem domain thoroughly!


This can be simplified a little bit. You use \s+ before and after, but is this necessary? I think you need just to assure word boundary or just one whitespace, thus, you can use

\s("Low"||"Medium"||"High")\s

or you can use \b instead of \s.

Second step is not to use capturing group, use non-capturing grous instead, because regex engine wastes time and memory for "remembering" groups, so you could try with:

\s(?:"Low"||"Medium"||"High")\s

TL;DR I've compared solutions on a recent rakudo, using your sample data. The ugly brute-force solution I present here is about twice as fast as the delightfully elegant solution Liz has presented. You could probably improve times another order of magnitude or more by breaking your data up and parallel processing it. I also discuss other options if that's not enough.

Alternations seems like a red herring

When I eliminated the alternation (leaving just "Low") and ran the code on a recent rakudo, the time taken was about the same. So I think that's a red herring and have not studied that aspect further.

Parallel processing looks promising

It's clear from your data that you could break it up, splitting at some arbitrary line, and then pattern match each piece in parallel, and then combine results.

That could net you a substantial win, depending on various factors related to your system and the data you process.

But I haven't explored this option.

The fastest results I've seen

The fastest results I've seen are with this code:

my %counts;
$g ~~ m:g / "\t " [ 'Low' || 'Medium' || 'High' ] \n { %counts{$/}++ } /;
say %counts.map: { .key.trim, .value }

This displays:

((Low 2877) (Medium 54))

This approach incorporates similar changes to those Michał Turczyn discussed, but pushed harder:

  • I've thrown away all capturing, not only not bothering to capture the 'Low' or whatever, but also throwing away all results of the match.

  • I've replaced the \s+ patterns with concrete characters rather than character classes. I've done so on the basis my casual tests with a recent rakudo suggested that's a bit faster.

Going beyond raku's regexes

Raku is designed for full Unicode generality. And its regex engine is extremely powerful. But it looks like your data is just ASCII and your pattern is a typical very simple regex. So you're using a sledgehammer to crack a nut. This shouldn't really matter -- the sledgehammer is supposed to be just fine as a nutcracker too -- but raku's regex engine remains very poorly optimized thus far.

Perhaps this nut is just a simple example and you're just curious about pushing raku's built in regex capabilities to their maximum current performance.

But if not, and you need yet more speed, and the speedups from this or other better solutions in raku, coupled with parallel processing, aren't enough to get you where you need to go, it's worth considering either not using raku or using it with another tool.

One idiomatic way to use raku with another tool is to use an Inline, with the obvious one in this case being Inline::Perl5. Using that you can try perl's fast default built in regex engine or even use its regex plugin capability to plug in a really fast regex engine.

And, given the simplicity of the pattern you're matching, you could even eschew regexes altogether by writing a quick bit of glue to some low-level raw text searching tool (perhaps saving character offsets and then generating corresponding raku match objects from the results).

Tags:

Regex

Raku