Writing an HTML Parser

Since now the html5 standard exist, writing a html parser is no longer trial-and-error or arcane knowledge.

Instead you just have to implement the standardized parsing algorithm.


The looseness of HTML can be accommodated by figuring out the missing open and close tags as needed. This is essentially what a validator like tidy does.

You'll keep a stack (perhaps implicitly with a tree) of the current context. For example, {<html>, <body>} means you're currently in the body of the html document. When you encounter a new node, you compare the requirements for that node to what's currently on the stack.

Suppose your stack is currently just {html}. You encounter a <p> tag. You look up <p> in a table that tells you a paragraph must be inside the <body>. Since you're not in the body, you implicitly push <body> onto your stack (or add a body node to your tree). Then you can put the <p> into the tree.

Now supposed you see another <p>. Your rules tell you that you cannot nest a paragraph within a paragraph, so you know you have to pop the current <p> off the stack (as though you had seen a close tag) before pushing the new paragraph onto the stack.

At the end of your document, you pop each remaining element off your stack, as though you had seen a close tag for each one.

The trick is to find a good way to represent the context requirements for each element.


so, I'll try for an answer here -

basically, what makes "plain" html parsing (not talking about valid xhtml here) different from xml parsing are loads of rules like never-ending <img>tags, or, strictly speaking, the fact that even the sloppiest of all html markups will somewhat render in a browser. You will need a validator along with the parser, to build your tree. But you'll have to decide on a standard for HTML you want to support, so that when you come across a weakness in the markup, you'll know it's an error and not just sloppy html.

know all the rules, build a validator, and then you'll be able to build a parser. that's Plan A.

Plan B would be, to allow for a certain error-resistance in your parser, which would render the validation step needless. For example, parse all the tags, and put them in a list, omitting any attributes, so that you can easily operate on the list, determining whether a tag is left open, or was never opened at all, to eventually get a "good" layout tree, which will be an approximate solution for sloppy layout, while being exact for correct layout.

hope that helped!


Harsh. Go

HTML is not XML. XHTML is XML. Most websites are HTML; some are XHTML. In XHTML all tags must be closed (or have no body, which is still closed).

If you want to write an HTML parser as a learning experiment, then go for it. If you want to write the next "Greaterest HTML parserer" then give it up. Apache (or somebody else) wins; the important information is: you don't know more than the large groups that specialize in parsing HTML.

To answer the question "How do I deal with this?" Read the W3C Spec on HTML. It answers your question. If your response is "but I don't want too" then you are actually saying "I'm a lazy goofrocket who wants to pretend to learn". If that is the case, I suggest you delete the post and move on; The Microsoft IE team probabaly has some documents that will interest you.

Less harsh answer

HTML is not easy to parse. At its loosest, you don't need head or body elements and alot of tags do not need to be closed. A basic rule when parsing HTML is if you encounter a new block element, automatically close the previous block element. You can not use a standard XML parser for this because HTML is not XML.

Similar to XML, you will need to split your document into elements, including free text elements.

XHTML is much easier because it must be well formed XML. You can use an XML parser for this.