How to parse hundred html source code files in shell?

The html-xml-utils package, available in most major Linux distributions, has a number of tools that are useful when dealing with HTML and XML documents. Particularly useful for your case is hxselect which reads from standard input and extracts elements based on CSS selectors. Your use case would look like:

hxselect '#the_div_id' <file

You might get a complaint about input not being well formed depending on what you are feeding it. This complaint is given over standard error and thus can be easily suppressed if needed. An alternative to this would to be to use Perl's HTML::PARSER package; however, I will leave that to someone with Perl skills less rusty than my own.


Try pup, a command line tool for processing HTML. For example:

pup '#the_div_id' < file.html

Here's an untested Perl script that extracts <div id="the_div_id"> elements and their contents using HTML::TreeBuilder.

#!/usr/bin/env perl
use strict;
use warnings;
use HTML::TreeBuilder;
foreach my $file_name (@ARGV) {
    my $tree = HTML::TreeBuilder->new;
    $tree->parse_file($file_name);
    for my $subtree ($tree->look_down(_tag => "div", id => "the_div_id")) {
        my $html = $subtree->as_HTML;
        $html =~ s/(?<!\n)\z/\n/;
        print $html;
    }
    $tree = $tree->delete;
}

If you're allergic to Perl, Python has HTMLParser.

P.S. Do not try using regular expressions..