Efficient String/Pattern Matching in C++ (suffixarray, trie, suffixtree?)

Given your comment that the patterns do not need to be updated at runtime I'm not sure you need a runtime structure at all.

I'd recommend using re2c or ragel to compile the patterns to code that will do the pattern matching.


You might want to look at flex. From the manual:

flex is a tool for generating scanners. A scanner is a program which recognizes lexical patterns in text. The flex program reads the given input files, or its standard input if no file names are given, for a description of a scanner to generate. The description is in the form of pairs of regular expressions and C code, called rules. flex generates as output a C source file, lex.yy.c by default, which defines a routine yylex(). This file can be compiled and linked with the flex runtime library to produce an executable. When the executable is run, it analyzes its input for occurrences of the regular expressions. Whenever it finds one, it executes the corresponding C code.

Also this:

The main design goal of flex is that it generate high-performance scanners. It has been optimized for dealing well with large sets of rules.

For example, this scanner matches the three patterns in your post:

%%
"WHAT IS XYZ?"      puts("matched WHAT-IS-XYZ");
"WHAT IS ".*"?"     puts("matched WHAT-IS");
"HOW MUCH ".*"?"    puts("matched HOW-MUCH");

Flex works by generating a discrete finite automaton (DFA). A DFA looks at each input character exactly once. There is no backtracking, even when matching wildcards. Run time is O(N) where N is the number of input characters. (More patterns will generate larger DFA tables, which will cause more cache misses, so there is some penalty for more patterns. But that is true of any matching system I can think of.)

However, you will have to list your patterns in the proper order to match them correctly. Flex may tell you if there's a problem. For example, if you reverse the order of the WHAT-IS-XYZ and WHAT-IS patterns in the above scanner, flex will tell you:

:; flex matcher.l
matcher.l:3: warning, rule cannot be matched

If you can meet flex's requirements, flex should give you a very fast scanner.


Check out CritBit trees:

Example source code that's trivial to C++-ise if you really feel the need.

To find all matches you use the function critbit0_allprefixed

e.g.

// Find all strings that start with, or are equal to, "WHAT IS"`
critbit0_allprefixed(tree, "WHAT IS", SomeCallback);`

SomeCallback is called for each match.