Command-line CSS selector tool

Use the W3C tools for HTML/XML parsing and extraction of content using CSS selectors. For example:

hxnormalize -l 240 -x filename.html | hxselect -s '\n' -c "td.data"

Will produce the desired output:

Tabular Content 1
Tabular Content 2

Using a line length of 240 characters ensures that elements with long content will not be split across multiple lines. The hxnormalize -x command creates a well-formed XML document, which can be used by hxselect.


CSS Solution

The Element Finder command will partially accomplish this task:

  • https://github.com/keeganstreet/element-finder
  • http://keegan.st/2012/06/03/find-in-files-with-css-selectors/

For example:

elfinder -j -s td.data -x "html"

This renders the result in JSON format, which can be extracted.

XML Solution

The XML::Twig module ("sudo apt-get install xml-twig-tools") comes with a tool named xml_grep that is able to do just that, provided that your HTML is well-formed, of course.

I'm sorry I'm not able to test this at the moment, but something like this should work:

xml_grep -t '*/div[@class="content"]/table/tbody/tr/td[@class="data"]' file.html

https://github.com/ericchiang/pup has a CSS-based query language that conforms closely to your example. In fact, with your input, the following command:

pup "body > div.content > table > tbody > tr > td.data text{}"

produces:

Tabular Content 1
Tabular Content 2

The trailing text{} removes the HTML tags.

One nice feature is that the full path need not be given, so that again with your example:

$ pup 'td.data text{}' < input.html
Tabular Content 1
Tabular Content 2

One advantage of pup is that it uses the golang.org/x/net/html package for parsing HTML5.