How to get text of a page using wget without html?

wget will only retrieve the document. If the document is in HTML, what you want is the result of parsing the document.

You could, for example, use lynx -dump -nolist, if you have lynx around.

lynx is a lightweight, simple web browser, which has the -dump feature, used to output the result of the parsing process. -nolist avoids the list of links at the end, which will appear if the page has any hyperlinks.

As mentioned by @Thor, elinks can be used for this too, as it also has a -dump option (and has -no-references to omit the list of links). It may be especially useful if you walk across some site using -sigh- frames (MTFBWY).

Also, keep in mind that, unless the page is really just C code with HTML tags, you will need to check the result, just to make sure there's nothing more than C code there.

If you don't have these other tools installed, only wget, and the page has no formatting just plain text and links, e.g. source code or a list of files, you can strip the HTML using sed like this:

wget -qO- http://address/of/page/you/want/to/view/ | sed -e 's/<[^>]*>//g'

This uses wget to dump the source of the page to STDOUT and sed to strip any < > pairs and anything between them.

You can then redirect the output of the sed command to the file you want to create using > :

wget -qO- http://.../ | sed -e 's/<[^>]*>//g' > downloaded_file.txt

NB: You may find that it has extra whitespace in the file that you don't want (e.g. lines are indented a few columns)

It may be easiest to use your text editor to tidy the file up that (or a source formatter as you're downloading C source code).

If you need to do the same simple thing to every line of the file you could include a command to do that in the sed command (here stripping one leading space):

wget -qO- http://.../ | sed -e 's/<[^>]*>//g;s/^ //g' > downloaded_stripped_file.txt

just to add another tool. I prefer w3m, which is a lynx like console browser. You may want to check out whats already available on your system.

w3m -dump website.html

How to get text of a page using wget without html?

Tags:

Linux

Wget

Related

Recent Posts