Options for HTML scraping?

The Ruby world's equivalent to Beautiful Soup is why_the_lucky_stiff's Hpricot.


In the .NET world, I recommend the HTML Agility Pack. Not near as simple as some of the above options (like HTMLSQL), but it's very flexible. It lets you maniuplate poorly formed HTML as if it were well formed XML, so you can use XPATH or just itereate over nodes.

http://www.codeplex.com/htmlagilitypack


BeautifulSoup is a great way to go for HTML scraping. My previous job had me doing a lot of scraping and I wish I knew about BeautifulSoup when I started. It's like the DOM with a lot more useful options and is a lot more pythonic. If you want to try Ruby they ported BeautifulSoup calling it RubyfulSoup but it hasn't been updated in a while.

Other useful tools are HTMLParser or sgmllib.SGMLParser which are part of the standard Python library. These work by calling methods every time you enter/exit a tag and encounter html text. They're like Expat if you're familiar with that. These libraries are especially useful if you are going to parse very large files and creating a DOM tree would be long and expensive.

Regular expressions aren't very necessary. BeautifulSoup handles regular expressions so if you need their power you can utilize it there. I say go with BeautifulSoup unless you need speed and a smaller memory footprint. If you find a better HTML parser on Python, let me know.