Is there something like a "CSS selector" or XPath grep?

There are two tools:

  • pup - Inspired by jq, pup aims to be a fast and flexible way of exploring HTML from the terminal.

  • htmlq - Likes jq, but for HTML. Uses CSS selectors to extract bits of content from HTML files.

Examples:

$ wget http://en.wikipedia.org/wiki/Robots_exclusion_standard -O robots.html

$ pup --color 'title' < robots.html
<title>
 Robots exclusion standard - Wikipedia
</title>

$ htmlq --text 'title' < robots.html
Robots exclusion standard - Wikipedia

I have built a command line tool with Node JS which does just this. You enter a CSS selector and it will search through all of the HTML files in the directory and tell you which files have matches for that selector.

You will need to install Element Finder, cd into the directory you want to search, and then run:

elfinder -s "div.a ul.b"

For more info please see http://keegan.st/2012/06/03/find-in-files-with-css-selectors/


Try this:

  1. Install http://www.w3.org/Tools/HTML-XML-utils/.
    • Ubuntu: aptitude install html-xml-utils
    • MacOS: brew install html-xml-utils
  2. Save a web page (call it filename.html).
  3. Run: hxnormalize -l 240 -x filename.html | hxselect -s '\n' -c "label.black"

Where "label.black" is the CSS selector that uniquely identifies the name of the HTML element. Write a helper script named cssgrep:

#!/bin/bash

# Ignore errors, write the results to standard output.
hxnormalize -l 240 -x $1 2>/dev/null | hxselect -s '\n' -c "$2"

You can then run:

cssgrep filename.html "label.black"

This will generate the content for all HTML label elements of the class black.

The -l 240 argument is important to avoid parsing line-breaks in the output. For example if <label class="black">Text to \nextract</label> is the input, then -l 240 will reformat the HTML to <label class="black">Text to extract</label>, inserting newlines at column 240, which simplifies parsing. Extending out to 1024 or beyond is also possible.

See also:

  • https://superuser.com/a/529024/9067 - similar question
  • https://gist.github.com/Boldewyn/4473790 - wrapper script