Output the full name of titin

JavaScript (ES6), 16840 16825 16814 bytes

Saved a couple of bytes by:

  • inserting the last acid prefix in the packed string as suggested by ETHproductions
  • choosing the next character group in a slightly more efficient way when there's a tie

Assumes Windows-1252 character encoding. Outputs without a trailing newline.

Below is a simplified version without any actual data.

[...Array(256)].map((_,i)=>i<52|i>126&i<161|i==92|i==96?0:s=s.split(String.fromCharCode(i)).join(`...dictionary...`.slice(j,j+=2)),j=0,s=`...packed_data...`)&&[...s].map(c=>'methion|threon|glutamin|alan|prol|phen|leuc|ser|val|glutam|glyc|histid|isoleuc|tryptoph|argin|aspart|lys|asparagin|tyros|cystein'.split`|`[c.charCodeAt()-32]).join`yl`+'ine'

And here is the full version:

Try it online!

The TIO link includes some additional code to print the MD5 of the output rather than printing the output itself:

ec73f8229f7f8f2e542c316a41cbfff4

How?

This was compressed by:

  1. Mapping the 20 distinct amino acids to ASCII characters 32 to 51.
  2. Repeatedly replacing the most frequent 2-character group by a new ASCII character in the following ranges:
    • 52-91
    • 93-95
    • 97-126
    • 161-255

Which leaves us with a dictionary of 336 bytes (168 entries of 2 characters) and a final compressed string of 16161 bytes.

The above code is doing the exact opposite.


Bash + sed + xz, 13322 13290 12785 12706 12694 12687 12681 bytes

sed 1d $0|unxz|sed /`sed 's:[A-T]:yl/g;s/&/:g'<<<AvalBglutamClysDthreonEserFprolGglycHleucIisoleucJalanKaspartLarginMasparaginNtyrosOglutaminPphenylalanQtryptophRhistidScysteinTmethion`yl/g\;s/yl$/ine/
<12479 bytes of binary data>

Try it online!


Mathematica, 319 + 14971 = 15290 bytes

For[p={};a={t={{{{{tyros,asparagin},prol},{ser,threon}},{{lys,alan},{glutam,val}}},{{{{phen,{methion,cystein}},argin},{aspart,{glutamin,{histid,tryptoph}}}},{leuc,{,glyc}}}},r=Join@@IntegerDigits[BinaryReadList@"b",2,8]},r!={},If[AtomQ@a,p=Join[p,{a,yl}/.{,yl}->{iso}];a=t,a=a[[1+#&@@r]];r=Rest@r]];##~Print~leucine&@@p

Incredibly, this is exactly one byte more than this bzip2 answer. Can somebody find a couple of bytes to save?!

This is a program that prints the full name of titin to STDOUT. We use the same structure described in Jörg Hülsermann's PHP answer, and additionally note that all the words except for "iso" end in "yl", which we add separately each time unless inappropriate (this is what p=Join[p,{a,yl}/.{,yl}->{iso}] does). We store the data in a 14971 -byte file called "b" in the working directory (hex dump is here), which is converted to the list of its corresponding bits by Join@@IntegerDigits[BinaryReadList@"b",2,8].

That list of bits has been determined by Huffman coding, which is a lovely data compression scheme that takes relative frequencies into account; it requires the decoding template, stored here as the binary tree t, as well as the raw list of bits. Inexplicably, Mathematica doesn't have a Huffman decoding (or encoding) built-in, so that's what the For-loop implements.