Using wget, What is the right command to get gzipped version instead of the actual HTML
If you request gzip'ed content (using the accept-encoding: gzip header, which is correct), then it's my understanding that wget can't then read the content. So you will end up with a single, gzipped file on disk, for the first page you hit, but no other content.
i.e. you can't use wget to request gzipped content and to recurse the entire site at the same time.
I think there's a patch that allows wget to support this function but it's not in the default distribution version.
If you include the -S flag you can tell if the web server is responding with the correct type of content. For example,
wget -S --header="accept-encoding: gzip" wordpress.com --2011-06-17 16:06:46-- http://wordpress.com/ Resolving wordpress.com (wordpress.com)... 220.127.116.11, 18.104.22.168, 22.214.171.124 Connecting to wordpress.com (wordpress.com)|126.96.36.199|:80... connected. HTTP request sent, awaiting response... HTTP/1.1 200 OK Server: nginx Date: Fri, 17 Jun 2011 15:06:47 GMT Content-Type: text/html; charset=UTF-8 Connection: close Vary: Accept-Encoding Last-Modified: Fri, 17 Jun 2011 15:04:57 +0000 Cache-Control: max-age=190, must-revalidate Vary: Cookie X-hacker: If you're reading this, you should visit automattic.com/jobs and apply to join the fun, mention this header. X-Pingback: http://wordpress.com/xmlrpc.php Link: <http://wp.me/1>; rel=shortlink X-nananana: Batcache Content-Encoding: gzip Length: unspecified [text/html]
The content encoding clearly states gzip, however for linux.about.com (currently),
wget -S --header="accept-encoding: gzip" linux.about.com --2011-06-17 16:12:55-- http://linux.about.com/ Resolving linux.about.com (linux.about.com)... 188.8.131.52 Connecting to linux.about.com (linux.about.com)|184.108.40.206|:80... connected. HTTP request sent, awaiting response... HTTP/1.1 200 OK Date: Fri, 17 Jun 2011 15:12:56 GMT Server: Apache Set-Cookie: TMog=B6HFCs2H20kA1I4N; domain=.about.com; path=/; expires=Sat, 22-Sep-12 14:19:35 GMT Set-Cookie: Mint=B6HFCs2H20kA1I4N; domain=.about.com; path=/ Set-Cookie: zBT=1; domain=.about.com; path=/ Vary: * PRAGMA: no-cache P3P: CP="IDC DSP COR DEVa TAIa OUR BUS UNI" Cache-Control: max-age=-3600 Expires: Fri, 17 Jun 2011 14:12:56 GMT Connection: close Content-Type: text/html Length: unspecified [text/html]
It's returning text/html.
Because some older browsers still have issues with gzip encoded content, many sites only enable it based on browser identification. They often turn it off by default and only turn it one when they know the browser can support it - and they usually don't include wget in that list. This means you may find wget never returns gzip content even if the site appears to do so for your browser.