HTML to UNFORMATTED plain text?

Use w3m -dump <page.html>.

It will give you the text representation of the html file.

From the man page:

-dump  dump formatted page into stdout

Although is says formatted, the output is just plain text.


html2text is a Python script that converts a page of HTML into equivalent Markdown-structured text. html2text can be downloaded and run in any operating system that has Python installed. The html2text program is in the repositories of many Linux distributions and it can be run from the command-line like this:

html2text -style pretty input.html  

This command not only converts the original html file to text, but it also does a pretty good job of making the plain text output easy to read. The headings look like headings, the lists look like lists, etc.

If you're having trouble with automatically converting tables from webpages to unformatted text this can be easily done with a modern markdown editor like Typora or Mark Text GUI applications for Windows/Mac/Linux. Comparing these two applications Mark Text is better than Typora at accurately capturing everything on a webpage and Typora has a more user-friendly editor, so I use both applications. I use Mark Text as a webpage grabber, and then I copy/paste the markdown text I captured into Typora and use Typora to edit it.


As mentioned by Gombai Sándor, in a comment to NZD's answer:

lynx -dump -nolist -nomargins

When run from the command-line with a URL, it writes the output to stdout. This seems to work very well. -nomargins may not be supported if one only has access to an older version of lynx (i.e. Lynx Version 2.8.5rel.5 (29 Oct 2005) on an old UNIX).

The output appears quite free of markup and links, with some potential exceptions (the following list may not be typical or exhaustive):

  • Extra white space does seem to occur in tabular data, and, at least in some cases, it appears while the white space is usually helpful for extracting the tabular data, it is occasionally inconsistent in ways that complicate parsing.
  • While links are not dumped, visible text may output. For example, footnote references may render as asterisks, or, on a wiki, clickables may render as the equivalent plain-text (without underlying URL).
  • Some references may expand and output the alternate text.
  • Unordered lists dump with asterisks and indentation.
  • Order lists dump with numbers and indentation.
  • Input fields may appear as underscores

Tags:

Linux

Html