What is the runtime difference between different parsing algorithms?

I did a study of LR parser speed between LRSTAR and YACC.

In 1989 I compared the matrix parser tables defined in the paper, "Optimization Of Parser Tables For Portable Compilers" to the YACC parser tables (comb structure). These are both LR or LALR parser tables. I found that the matrix parser tables were usually two times the speed of the comb parser tables. This is because the number of nonterminal transitions (goto actions) is usually about twice the number of terminal transitions. The matrix tables have a faster nonterminal transition. However, there are many other things going on in a parser besides the state transitions, so this may not be the bottleneck.

In 2009 I compared the matrix lexer tables to the flex-generated lexer tables and also to the direct-code lexers generated by re2c. I found that the matrix tables were about two times the speed of the flex generated tables and almost as fast as the re2c lexer code. The benefit of the matrix tables is that they compile much quicker that the direct-code tables and they are smaller. And finally, if you allow the matrix tables to be very large (with no compression) they can actually be faster than the direct-code (re2c) tables. For a graph showing the comparison see: the LRSTAR comparison page

Compiler front-ends (without preprocessing) built with LRSTAR are processing about 2,400,000 lines of code per second and this includes building a symbol table and abstract syntax tree while parsing and lexing. The lexers built with DFA are processing 30,000,000 tokens per second. There is another advantage to matrix table-driven lexers when using DFA. The lexer skeleton can be rewritten in assembly language. When I did this in 1986, the speed of the lexer was two times the speed of the C code version.

I don't have much experience with LL parser speed or recursive descent parser speed. Sorry. If ANTLR could generate C++ code, then I could do a speed test for its parsers.

LR parsers IMHO can be the fastest. Basically they use a token as an index into a lookahead set or a transition table to decide what to do next (push a state index, pop a state indexes/call a reduction routine). Converted to machine code this can be just a few machine instructions. Pennello discusses this in detail in his paper:

Thomas J. Pennello: Very fast LR parsing. SIGPLAN Symposium on Compiler Construction 1986: 145-151

LL parsers involve recursive calls, which are a bit slower than just plain table lookups, but they can be pretty fast.

GLR parsers are generalizations of LR parsers, and thus have to be slower than LR parsers. A key observation is that most of the time a GLR parser is acting exactly as an LR parser would, and one can make that part run essentially as the same speed as an LR parser, so they can be fairly fast.

Your parser is likely to spend more time breaking the input stream into tokens, than executing the parsing algorithm, so these differences may not matter a lot.

In terms of getting your grammar into a usable form, the following is the order in which the parsing technologies "make it easy":

GLR (really easy: if you can write grammmar rules, you can parse)
LR(k) (many grammars fit, extremely few parser generators)
LR(1) (most commonly available [YACC, Bison, Gold, ...]
LL (usually requires significant reengineering of grammar to remove left recursions)
Hand-coded recursive descent (easy to code for simple grammars; difficult to handle complex grammars and difficult to maintain if the grammar changes a lot)

What is the runtime difference between different parsing algorithms?

Tags:

Algorithm

Language Agnostic

Parsing

Related

Recent Posts