wget - How to download recursively and only specific mime-types/extensions (i.e. text only)

You could specify a list of allowed resp. disallowed filename patterns:

Allowed:

-A LIST
--accept LIST

Disallowed:

-R LIST
--reject LIST

LIST is comma-separated list of filename patterns/extensions.

You can use the following reserved characters to specify patterns:

  • *
  • ?
  • [
  • ]

Examples:

  • only download PNG files: -A png
  • don't download CSS files: -R css
  • don't download PNG files that start with "avatar": -R avatar*.png

If the file has no extension resp. the file name has no pattern you could make use of, you'd need MIME type parsing, I guess (see Lars Kotthoffs answer).


You could try patching wget with this (also here) to filter by MIME type. This patch is quite old now though, so it might not work anymore.


A new Wget (Wget2) already has feature:

--filter-mime-type    Specify a list of mime types to be saved or ignored`

### `--filter-mime-type=list`

Specify a comma-separated list of MIME types that will be downloaded.  Elements of list may contain wildcards.
If a MIME type starts with the character '!' it won't be downloaded, this is useful when trying to download
something with exceptions. For example, download everything except images:

  wget2 -r https://<site>/<document> --filter-mime-type=*,\!image/*

It is also useful to download files that are compatible with an application of your system. For instance,
download every file that is compatible with LibreOffice Writer from a website using the recursive mode:

  wget2 -r https://<site>/<document> --filter-mime-type=$(sed -r '/^MimeType=/!d;s/^MimeType=//;s/;/,/g' /usr/share/applications/libreoffice-writer.desktop)

Wget2 has not been released as of today, but will be soon. Debian unstable already has an alpha version shipped.

Look at https://gitlab.com/gnuwget/wget2 for more info. You can post questions/comments directly to [email protected].