Write an URL compressor

CJam 308954 268306 bytes

Compressor (511 bytes)

0000000: 22 2f 25 2d 30 31 32 33 35 61 65 0a 2e 34 37 38  "/%-01235ae..478
0000010: 39 62 63 66 67 69 6c 6d 70 73 74 75 0a 36 41 42  9bcfgilmpstu.6AB
0000020: 43 44 45 46 49 4a 4b 4c 4d 53 58 5f 64 68 6a 6b  CDEFIJKLMSX_dhjk
0000030: 6e 6f 72 76 77 78 79 7a 97 a0 a5 a6 22 0a 4e 2f  norvwxyz....".N/
0000040: 5f 3a 2b 32 35 36 63 2c 5e 61 2b 3a 47 3b 22 63  _:+256c,^a+:G;"c
0000050: 6f 7f 6d 73 74 6e 67 6f 6c 6f 72 64 65 7f 85 83  o.mstngolorde...
0000060: 66 67 87 86 88 81 61 8a 63 8b 6b 61 82 80 2f 2e  fg....a.c.ka../.
0000070: 8e 65 8f 8d 90 68 91 63 92 78 93 65 94 8c 95 65  .e...h.c.x.e...e
0000080: 72 6f 6e 84 67 77 69 9a 6b 9b 69 74 68 96 61 69  ron.gwi.k.ith.ai
0000090: 98 69 6e 65 74 6e a1 65 6e 64 69 25 32 a5 30 a6  .inetn.endi%2.0.
00000a0: a6 77 77 a8 77 6c 65 72 65 2e 99 ac 2f ad 9c 9f  .ww.wlere.../...
00000b0: 73 81 af 89 2e b1 9e b1 96 65 a4 70 b4 9c b5 72  s........e.p...r
00000c0: 69 a5 42 b7 70 b9 74 62 8f b6 61 bc ae 61 72 b2  i.B.p.tb..a..ar.
00000d0: 2f 61 74 6c 6f 61 6d 74 6d 68 c3 c4 6c 65 b0 75  /atloamtmh..le.u
00000e0: c6 71 c7 b3 c8 63 ba 73 ca 88 cb 72 6f 61 6e 69  .q...c.s...roani
00000f0: 9d cf 75 d0 bb 67 d1 65 73 75 73 d4 97 b3 d5 6a  ..u..g.esus....j
0000100: c2 63 d7 c1 77 d9 8f 70 68 db bb 70 dc 73 75 74  .c..w..ph..p.sut
0000110: de a4 df 61 e0 98 90 85 e2 69 e3 d3 83 e5 8d a9  ...a.....i......
0000120: 2e 61 dd 76 97 6f e9 8c ea eb 66 ec da 6a 73 ee  .a.v.o....f..js.
0000130: 66 ef 69 f0 64 f1 64 f2 aa 70 cd a2 2f 2e f5 f3  f.i.d.d..p../...
0000140: f6 e1 f6 e6 73 f9 ae 61 6c 75 6e bf 34 70 79 fe  ....s..alun.4py.
0000150: 9d 63 74 fc 00 b8 b8 75 72 03 8f a5 43 64 6f 06  .ct....ur...Cdo.
0000160: 63 d6 73 ed c8 a7 a7 b8 a6 a6 a5 69 73 72 c2 75  c.s........isr.u
0000170: 74 69 82 bd 2f 2e 11 a3 12 67 49 14 43 15 41 16  ti../....gI.C.A.
0000180: 16 a3 74 07 73 72 fc d2 0d 69 74 2e c5 63 68 61  ..t.sr...it..cha
0000190: 73 e8 3f 20 63 5c 22 3d be 1e 6f 03 69 aa a5 39  s.? c\"=..o.i..9
00001a0: 60 a5 6f 66 22 32 2f 31 32 39 2c 31 32 37 66 2b  `.of"2/129,127f+
00001b0: 33 33 2c 2b 3a 63 22 5c 22 3c 3e 5c 5e 60 7b 7d  33,+:c"\"<>\^`{}
00001c0: 22 2b 6c 5f 34 3d 27 73 3d 3a 4c 37 2b 3e 40 7b  "+l_4='s=:L7+>@{
00001d0: 2f 5c 28 40 5c 2a 7d 2f 30 5c 7b 47 7b 31 24 23  /\(@\*}/0\{G{1$#
00001e0: 29 7d 23 5f 47 3d 3a 54 40 23 40 54 2c 2a 2b 34  )}#_G=:T@#@T,*+4
00001f0: 2a 2b 7d 2f 32 2a 4c 2b 32 35 36 62 3a 63 0a     *+}/2*L+256b:c.

Decompressor (510 bytes)

0000000: 22 2f 25 2d 30 31 32 33 35 61 65 0a 2e 34 37 38  "/%-01235ae..478
0000010: 39 62 63 66 67 69 6c 6d 70 73 74 75 0a 36 41 42  9bcfgilmpstu.6AB
0000020: 43 44 45 46 49 4a 4b 4c 4d 53 58 5f 64 68 6a 6b  CDEFIJKLMSX_dhjk
0000030: 6e 6f 72 76 77 78 79 7a 97 a0 a5 a6 22 0a 4e 2f  norvwxyz....".N/
0000040: 5f 3a 2b 32 35 36 63 2c 5e 61 2b 3a 47 3b 22 63  _:+256c,^a+:G;"c
0000050: 6f 7f 6d 73 74 6e 67 6f 6c 6f 72 64 65 7f 85 83  o.mstngolorde...
0000060: 66 67 87 86 88 81 61 8a 63 8b 6b 61 82 80 2f 2e  fg....a.c.ka../.
0000070: 8e 65 8f 8d 90 68 91 63 92 78 93 65 94 8c 95 65  .e...h.c.x.e...e
0000080: 72 6f 6e 84 67 77 69 9a 6b 9b 69 74 68 96 61 69  ron.gwi.k.ith.ai
0000090: 98 69 6e 65 74 6e a1 65 6e 64 69 25 32 a5 30 a6  .inetn.endi%2.0.
00000a0: a6 77 77 a8 77 6c 65 72 65 2e 99 ac 2f ad 9c 9f  .ww.wlere.../...
00000b0: 73 81 af 89 2e b1 9e b1 96 65 a4 70 b4 9c b5 72  s........e.p...r
00000c0: 69 a5 42 b7 70 b9 74 62 8f b6 61 bc ae 61 72 b2  i.B.p.tb..a..ar.
00000d0: 2f 61 74 6c 6f 61 6d 74 6d 68 c3 c4 6c 65 b0 75  /atloamtmh..le.u
00000e0: c6 71 c7 b3 c8 63 ba 73 ca 88 cb 72 6f 61 6e 69  .q...c.s...roani
00000f0: 9d cf 75 d0 bb 67 d1 65 73 75 73 d4 97 b3 d5 6a  ..u..g.esus....j
0000100: c2 63 d7 c1 77 d9 8f 70 68 db bb 70 dc 73 75 74  .c..w..ph..p.sut
0000110: de a4 df 61 e0 98 90 85 e2 69 e3 d3 83 e5 8d a9  ...a.....i......
0000120: 2e 61 dd 76 97 6f e9 8c ea eb 66 ec da 6a 73 ee  .a.v.o....f..js.
0000130: 66 ef 69 f0 64 f1 64 f2 aa 70 cd a2 2f 2e f5 f3  f.i.d.d..p../...
0000140: f6 e1 f6 e6 73 f9 ae 61 6c 75 6e bf 34 70 79 fe  ....s..alun.4py.
0000150: 9d 63 74 fc 00 b8 b8 75 72 03 8f a5 43 64 6f 06  .ct....ur...Cdo.
0000160: 63 d6 73 ed c8 a7 a7 b8 a6 a6 a5 69 73 72 c2 75  c.s........isr.u
0000170: 74 69 82 bd 2f 2e 11 a3 12 67 49 14 43 15 41 16  ti../....gI.C.A.
0000180: 16 a3 74 07 73 72 fc d2 0d 69 74 2e c5 63 68 61  ..t.sr...it..cha
0000190: 73 e8 3f 20 63 5c 22 3d be 1e 6f 03 69 aa a5 39  s.? c\"=..o.i..9
00001a0: 60 a5 6f 66 22 32 2f 5b 71 32 35 36 62 32 6d 64  `.of"2/[q256b2md
00001b0: 22 68 74 74 70 22 6f 27 73 2a 6f 22 3a 2f 2f 22  "http"o's*o"://"
00001c0: 6f 7b 34 6d 64 47 3d 5f 2c 40 5c 6d 64 40 3d 5c  o{4mdG=_,@\md@=\
00001d0: 7d 68 3b 5d 57 25 31 32 39 2c 31 32 37 66 2b 33  }h;]W%129,127f+3
00001e0: 33 2c 2b 3a 63 22 5c 22 3c 3e 5c 5e 60 7b 7d 22  3,+:c"\"<>\^`{}"
00001f0: 2b 57 25 7b 2f 5c 29 40 5c 2a 7d 2f 4e 0a        +W%{/\)@\*}/N.

Algorithm

  1. Strip the <scheme>:// part from the URL.

  2. Replace character pairs by unused code points in the 0 - 255 range.

    This uses a static dictionary which is included in the source code.

  3. Use arithmetic encoding on the modified input string.

    To comply with the source code size limit, this is done by splitting the 256 code points into 4 groups and pretending the groups and the code points in a fixed group have equal probabilities.

  4. Append a bit indicating the scheme to the resulting integer.

  5. Convert the integer into a string.

Test cases

Create the source code files.

$ xxd -p -r > comp.cjam <<< 222f252d303132333561650a2e3437383962636667696c6d707374750a36414243444546494a4b4c4d53585f64686a6b6e6f72767778797a97a0a5a6220a4e2f5f3a2b323536632c5e612b3a473b22636f7f6d73746e676f6c6f7264657f8583666787868881618a638b6b6182802f2e8e658f8d9068916392789365948c9565726f6e846777699a6b9b69746896616998696e65746ea1656e64692532a530a6a67777a8776c6572652e99ac2fad9c9f7381af892eb19eb19665a470b49cb57269a542b770b974628fb661bcae6172b22f61746c6f616d746d68c3c46c65b075c671c7b3c863ba73ca88cb726f616e699dcf75d0bb67d165737573d497b3d56ac263d7c177d98f7068dbbb70dc737574dea4df61e0989085e269e3d383e58da92e61dd76976fe98ceaeb66ecda6a73ee66ef69f064f164f2aa70cda22f2ef5f3f6e1f6e673f9ae616c756ebf347079fe9d6374fc00b8b87572038fa543646f0663d673edc8a7a7b8a6a6a5697372c275746982bd2f2e11a3126749144315411616a374077372fcd20d69742ec563686173e83f20635c223dbe1e6f0369aaa53960a56f6622322f3132392c313237662b33332c2b3a63225c223c3e5c5e607b7d222b6c5f343d27733d3a4c372b3e407b2f5c28405c2a7d2f305c7b477b312423297d235f473d3a54402340542c2a2b342a2b7d2f322a4c2b323536623a630a

$ xxd -p -r > decomp.cjam <<< 222f252d303132333561650a2e3437383962636667696c6d707374750a36414243444546494a4b4c4d53585f64686a6b6e6f72767778797a97a0a5a6220a4e2f5f3a2b323536632c5e612b3a473b22636f7f6d73746e676f6c6f7264657f8583666787868881618a638b6b6182802f2e8e658f8d9068916392789365948c9565726f6e846777699a6b9b69746896616998696e65746ea1656e64692532a530a6a67777a8776c6572652e99ac2fad9c9f7381af892eb19eb19665a470b49cb57269a542b770b974628fb661bcae6172b22f61746c6f616d746d68c3c46c65b075c671c7b3c863ba73ca88cb726f616e699dcf75d0bb67d165737573d497b3d56ac263d7c177d98f7068dbbb70dc737574dea4df61e0989085e269e3d383e58da92e61dd76976fe98ceaeb66ecda6a73ee66ef69f064f164f2aa70cda22f2ef5f3f6e1f6e673f9ae616c756ebf347079fe9d6374fc00b8b87572038fa543646f0663d673edc8a7a7b8a6a6a5697372c275746982bd2f2e11a3126749144315411616a374077372fcd20d69742ec563686173e83f20635c223dbe1e6f0369aaa53960a56f6622322f5b7132353662326d642268747470226f27732a6f223a2f2f226f7b346d64473d5f2c405c6d64403d5c7d683b5d57253132392c313237662b33332c2b3a63225c223c3e5c5e607b7d222b57257b2f5c29405c2a7d2f4e0a

Verify their integrity.

$ cksum comp.cjam decomp.cjam
2293013588 511 comp.cjam
1577103568 510 decomp.cjam

Download the CJam interpreter.

$ wget -q wget http://downloads.sourceforge.net/project/cjam/cjam-0.6.4/cjam-0.6.4.jar

Create an alias for running the interpreter.

$ alias cjam='java -jar cjam-0.6.4.jar'

Set encoding to ISO-8859-1 to store each character as a single byte.

$ LANG=en_US

Prepare the test file.

$ wget -q https://gist.githubusercontent.com/orlp/fd3411259469ade4c65d/raw/bb6b088c18c444d28729abc870d6076ca594a6de/urls.txt

$ echo >> urls.txt

For each line in the test file, feed the URL to the compressor, append the output to urls-comp.bin and feed it back to the decompressor. Save the combined output of all decompressions in urls-vrfy.txt. This will take a few minutes.

$ >urls-comp.bin

$ for URL in $(<urls.txt); do
> echo $URL | cjam comp.cjam | tee -a urls-comp.bin | cjam decomp.cjam
> done > urls-vrfy.txt

Verify that all URLs were decoded appropriately.

$ cksum urls.txt urls-vrfy.txt
3445245739 588562 urls.txt
3445245739 588562 urls-vrfy.txt

Compute the score.

$ wc -c urls-comp.bin
268306 urls-comp.bin

How it works

Common

Push the string containing the first three character groups.

"/%-01235ae
.4789bcfgilmpstu
6ABCDEFIJKLMSX_dhjknorvwxyz����"

Split at linefeeds, flatten a copy of the resulting array and compute the symmetric difference of the resulting string and the string of all code points.

N/_:+256c,^

Append the result to the array which now contains all four character groups. Save the result in the variable G and discard it from the stack.

a+:G;

Push the array (let's call it A) containing all character pairs to be substituted.

"comstngolorde��fg����a�c�ka��/.�e���h�c�x�e���eron�gwi�k�ith�ai�inetn�endi%2�0��ww�wlere.��/���s���.����e�p���ri�B�p�tb��a��ar�/atloamtmh��le�u�qdz�c�sʈ�roani��uлg�esusԗ��j�c��wُphۻp�sutޤ�a���i�Ӄit.�chas�? c\"=�oi��9`�of"
2/

Push the array [ 0 ... 128 ], add 127 to each element, append the array [ 0 ... 32 ], cast to Char and append the string "\"<>\^{}". The result is the string (let's call itS`) of all unused code points from 0 to 255.

129,127f+33,+:c"\"<>\^`{}"+

Compressor

...        " Generate G. Push A and S.                                        ";
l          " Read one line from STDIN.                                        ";
_4='s=     " Check if its 5th character is an 's'.                            ";
:L7+       " Save the result in L and add it to 7.                            ";
>          " Discard that many characters from the input string.              ";
@{         " For each character pair in A:                                    ";
  /        "   Split the input string at its occurrences.                     ";
  \(@\     "   Unshift one character C from S.                                ";
  *        "   Join the split string, using C as separator.                   ";
}/         "                                                                  ";
0          " Push 0 (accumulator).                                            ";
\{         " For each character C in the input string:                        ";
  G{1$#)}# "   Retrieve the index of the group C belongs to.                  ";
  _G=:T    "   Store the group in T.                                          ";
  @#       "   Push the index of C in T.                                      ";
  @T,*+4*+ "   Multiply the accumulator by the length of T and add the index. ";
}/         "                                                                  ";
2*L+       " Multiply the accumulator by 2 and add L.                         ";
256b:c     " Convert the accumulator (BigInteger) into a byte string.         ";

Decompressor

...        " Generate G. Push A.                                              ";
[q         " Start an array and read the whole input.                         ";
256b       " Convert the byte string into an integer.                         ";
2md        " Push quotient and residue of the division by 2.                  ";
"http"o    " Print 'http'.                                                    ";
's*o       " If the residue is 1, print 's'.                                  ";
"://"o     " Print '://'.                                                     ";
{          " While the integer is non-zero:                                   ";
  4md      "   Push quotient and residue of the division by 4.                ";
  G=_,     "   Push the corresponding character group and its length.         ";
  @\md     "   Divide the integer by the length. Push quotient and residue.   ";
  @=\      "   Retrieve the char corresponding to the residue from the group. ";
}h         "                                                                  ";
]W%        " End and reverse array.                                           ";
           " Since the elements are Chars, this yields a string.              ";
...        " Push S.                                                          ";
\W%{       " For each character in S reversed:                                ";
  /        "     Split the string at its occurrences.                         ";
  \)@\     "     Pop one character pair P from A.                             ";
  *        "     Join the split string, using P as separator.                 ";
}/         "                                                                  ";
N          " Push a linefeed.                                                 ";

Remarks

Decompression will work correctly only for the http and https schemes. The reasons are twofold:

  • The http part is hardcoded.

  • The arithmetic encoding uses the fact that <http|https>:// cannot be followed by a third slash.

    Since the integer 0 encodes any arbitrary run of slashes (character 0 of group 0), we would have to store the URL's length or the number of leading slashes to support, e.g., file:/// URLs.


BrainFuck: 550417

Encoder:

,,,,,[.,]

Decoder:

>++++++++++[-<++++++++++>]<++++.++++++++++++..----[.,]

It needs the url without linefeed and it expects 0 as EOF.

Example:

> echo -n "http://stackoverflow.com/" | beef encode.bf 
://stackoverflow.com/
> echo -n "://stackoverflow.com/" | beef decode.bf 
http://stackoverflow.com/