Jsoup like html parser for C++

If you are familiar with Qt Framework the most convenient way is using QWebElement (Reference here).

Otherwise, (as another post suggests) using Tidy to convert HTML to a valid XML and then using an XML parser such as libxml++ is a good option. You can find a sample code showing these two steps here.


Chromium has an open source parser. Also, the Google gumbo-parser looks cool.


Unfortunately, i guess there's no parser like Jsoup for C++ ...

Beside the libraries which are already mentioned here, there's a good overview about C++ (some C too) parser here: Free C or C++ XML Parser Libraries

For parsing i used TinyXML-2 for (Html-) DOM parsing; it's a very small (only 2 files) library that runs on most OS (even non-desktop).

LibXml

  • push and pull parser (DOM, SAX)
  • Validation
  • XPath and XPointer support
  • Cross-Plattform / good documentation

Apache Xerxces

  • push and pull parser (DOM, SAX)
  • Validation
  • No XPath support (but a package for this?)
  • Cross-Plattform / good documentation

If you are on C++ CLI, check out NSoup - a Jsoup port for .NET.

Some more:

  • htmlcxx - html and css APIs for C++
  • MSHTML (?)
  • pugixml (DOM / XPath and Unicode support)
  • LibCSS (CSS Parser) / LibDOM (DOM) (however, both in C)
  • hcxselect (CSS selector engine for C++)

Maybe you can combine a DOM Model / Parser and a CSS selector together?