Text between two tags

If you only want ... of all <tr>...</tr> do:

grep -o '<tr>.*</tr>' HTMLFILE | sed 's/\(<tr>\|<\/tr>\)//g' > NEWFILE

For multiline do:

tr "\n" "|" < HTMLFILE | grep -o '<tr>.*</tr>' | sed 's/\(<tr>\|<\/tr>\)//g;s/|/\n/g' > NEWFILE

Check the HTMLFILE first of the char "|" (not usual, but possible) and if it exists, change to one which doesn't exist.


You do have a requirement that warrants an HTML parser: you need to parse HTML. Perl's HTML::TreeBuilder, Python's BeautifulSoup and others are easy to use, easier than writing complex and brittle regular expressions.

perl -MHTML::TreeBuilder -le '
    $html = HTML::TreeBuilder->new_from_file($ARGV[0]) or die $!;
    foreach ($html->look_down(_tag => "tr")) {
        print map {$_->as_HTML()} $_->content_list();
    }
' input.html

or

python -c 'if True:
    import sys, BeautifulSoup
    html = BeautifulSoup.BeautifulSoup(open(sys.argv[1]).read())
    for tr in html.findAll("tr"):
        print "".join(tr.contents)
' input.html

sed and awk are not well suited for this task, you should rather use a proper html parser. For example hxselect from w3.org:

<htmlfile hxselect -s '\n' -c 'tr'