Why is there no decent sql parser?

When a manufacturer claims to support a language X, he means "something like the X standard" but not the standard. Manufacturers for historical reasons implement language X before the standard was as standard, so they start on the wrong foot; trying to make their version match the standard usually breaks their large base of user code; and they always want to add their own goodies to lock in their users.

This is true for SQL, C, C++... the only languages I know of where people try really hard to match the standard is Ada, and even it comes in multiple dialects. (Look what browsers accept!).

So you can't expect a off-the-shelf generic SQL parser to parse PLSQL. You really have to have a PLSQL parser. And these are hard to build as the documentation is poor, Oracle has no reason to fix it, and certainly has no motivation to help the grammar builder.

My company (Semantic Designs) has a PLSQL parser that covers 10g pretty well (Oracle's documentation is poor...we keep finding variations from the reference docuements) and does most of 11g. We've run it across millions of line of PLSQL code.


Good parsers are hard to write. That starts with the code generator for the parser code (which usually eats some (E)BNF-like syntax which has its own limitations).

Error handling in parsers is a research topic of its own. This is not only about detecting errors but also giving useful information what could be wrong and how to solve it. Some parsers don't even offer location information ("error happened at line/column").

Next, you have SQL which means "Structured Query Language", not "Standard Query Language". There is a SQL standard, even several, but you won't find a single database which implements any of them.

Oracle grudgingly offers VARCHAR but you better use VARCHAR2. Some databases offer recursive/tree-like queries. All of them use their own, special syntax for this. Joining is defined pretty clearly in the standard (join, left join, ...) but why bother if you can use +?

On top of that, for every database version, new features are added to the grammar.

So while you could write a parser that can read the standard cases, writing a parser that can support all the features which all the databases around the globe offer, is nearly impossible. And I'm not even talking about the bugs which you can encounter in these parsers.

One solution would be if all database vendors would publish the grammar files. But these are crown jewels (IP). So you should be happy that you can use them without having to pay a license fee per parsed character * number of CPUs.