How to determine if a string was compressed?

PRE:
I guess, if you send a request, you can immediately look into $http_response_header to see if the one of the items in the array is a variation of Content-Encoding: gzip. But this is LAME!
there is a far better method.


Here is HOW TO...

Check if its GZIP. Like a BOSS!

according to GZIP RFC:

The header of gzip content looks like this

+---+---+---+---+---+---+---+---+---+---+
|ID1|ID2|CM |FLG|     MTIME     |XFL|OS | (more-->)
+---+---+---+---+---+---+---+---+---+---+

the ID1 and ID2 identify the content as GZIP. And CM states that the ZLIB_ENCODING (the compression method) is ZLIB_ENCODING_DEFLATE - which is customarily used by GZIP with all web-servers.

oh! and they have fixed values:

  • The value of ID1 is "\x1f"
  • The value of ID2 is "\x8b"
  • The value of CM is "\x08" (or just 8...)

almost there:

$is_gzip = 0 === mb_strpos($mystery_string , "\x1f" . "\x8b" . "\x08");

Working example

<?php
/** @link https://gist.github.com/eladkarako/d8f3addf4e3be92bae96#file-checking_gzip_like_a_boss-php */

date_default_timezone_set("Asia/Jerusalem");

while (ob_get_level() > 0) ob_end_flush();
mb_language("uni");
@mb_internal_encoding('UTF-8');
setlocale(LC_ALL, 'en_US.UTF-8');

header('Time-Zone: Asia/Jerusalem');
header('Charset: UTF-8');
header('Content-Encoding: UTF-8');
header('Content-Type: text/plain; charset=UTF-8');
header('Access-Control-Allow-Origin: *');

function get($url, $cookie = '') {
  $html = @file_get_contents($url, false, stream_context_create([
    'http' => [
      'method' => "GET",
      'header' => implode("\r\n", [''
        , 'Pragma: no-cache'
        , 'Cache-Control: no-cache'
        , 'User-Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2310.0 Safari/537.36'
        , 'DNT: 1'
        , 'Accept-Language: en-US,en;q=0.8'
        , 'Accept: text/plain'
        , 'X-Forwarded-For: ' . implode(', ', array_unique(array_filter(array_map(function ($item) { return filter_input(INPUT_SERVER, $item, FILTER_SANITIZE_SPECIAL_CHARS); }, ['HTTP_X_FORWARDED_FOR', 'REMOTE_ADDR', 'HTTP_CLIENT_IP', 'SERVER_ADDR', 'REMOTE_ADDR']), function ($item) { return null !== $item; })))
        , 'Referer: http://eladkarako.com'
        , 'Connection: close'
        , 'Cookie: ' . $cookie
        , 'Accept-Encoding: gzip'
      ])
    ]]));

  $is_gzip = 0 === mb_strpos($html, "\x1f" . "\x8b" . "\x08", 0, "US-ASCII");

  return $is_gzip ? zlib_decode($html, ZLIB_ENCODING_DEFLATE) : $html;
}

$html = get('http://www.pogdesign.co.uk/cat/');

echo $html;

What do we see here that is worth mentioning?

  • start with initializing the PHP engine to use UTF-8 (since we don't really know if the web-server will return a GZIP content.
  • Providing the header Accept-Encoding: gzip, tells the web-sever, it may output a GZIP content.
  • Discovering GZIP content (you should use the multi-byte functions with ASCII encoding).
  • Finally returning the plain output, is easy using the ZLIB methods.

A string and a compressed string are both simply sequences of bytes. You cannot really distinguish one sequence of bytes from another sequence of bytes. You should know whether a blob of bytes represents a compressed format or not from accompanying metadata.

If you really need to guess programmatically, you have several things you can try:

  • Try to uncompress the string and see if the uncompress operation succeeds. If it fails, the bytes probably did not represent a compressed string.
  • Try to check for obvious "weird" bytes like anything before 0x20. Those bytes aren't typically used in regular text. There's no real guarantee that they occur in a compressed string though.
  • Use mb_check_encoding to see whether a string is valid in the encoding you suspect it to be in. If it isn't, it's probably compressed (or you checked for the wrong encoding). With the caveat that virtually any byte sequence is valid in virtually every single-byte encoding, so this'll only work for multi-byte encodings.