wget with wildcards in http downloads

I think these switches will do what you want with wget:

   -A acclist --accept acclist
   -R rejlist --reject rejlist
       Specify comma-separated lists of file name suffixes or patterns to 
       accept or reject. Note that if any of the wildcard characters, *, ?,
       [ or ], appear in an element of acclist or rejlist, it will be 
       treated as a pattern, rather than a suffix.

   --accept-regex urlregex
   --reject-regex urlregex
       Specify a regular expression to accept or reject the complete URL.

Example

$ wget -r --no-parent -A 'bar.*.tar.gz' http://url/dir/

There's a good reason that this can't work directly with HTTP, and that's that an URL is not a file path, although the use of / as a delimiter can make it look like one, and they do sometimes correspond.1

Conventionally (or, historically), web servers often do mirror directory hierarchies (for some -- e.g., Apache -- this is sort of integral) and even provide directory indexes much like a filesystem. However, nothing about the HTTP protocol requires this.

This is significant, because if you want to apply a glob on say, everything which is a subpath of http://foo/bar/, unless the server provides some mechanism to provide you with such (e.g. the aforementioned index), there's nothing to apply it the glob to. There is no filesystem there to search. For example, just because you know there are pages http://foo/bar/one.html and http://foo/bar/two.html does not mean you can get a list of files and subdirectories via http://foo/bar/. It would be completely within protocol for the server to return 404 for that. Or it could return a list of files. Or it could send you a nice jpg picture. Etc.

So there is no standard here that wget can exploit. AFAICT, wget works to mirror a path hierarchy by actively examining links in each page. In other words, if you recursively mirror http://foo/bar/index.html it downloads index.html and then extracts links that are a subpath of that.2 The -A switch is simply a filter that is applied in this process.

In short, if you know these files are indexed somewhere, you can start with that using -A. If not, then you are out of luck.


1. Of course an FTP URL is an URL too. However, while I don't know much about the FTP protocol, I'd guess based on it's nature that it may be of a form which allows for transparent globbing.

2. This means that there could be a valid URL http://foo/bar/alt/whatever/stuff/ that will not be included because it is not in any way linked to anything in the set of things linked to http://foo/bar/index.html. Unlike filesystems, web servers are not obliged to make the layout of their content transparent, nor do they need to do it in an intuitively obvious way.

Tags:

Wget