Can a file contain its md5sum inside it?

Consider this: you create a file that contains every member of the set of 16-byte sequences. An MD5 checksum is a 16-byte sequence, so by definition this file contains its own MD5 checksum. Somewhere.


Theoretically? Yes.

Practically, however, since /any/ change to a file's contents, no matter how minute, causes a drastic change in the checksum (which is how md5 checksums work, after all), you'd need to be able to predict how the checksum will change when you alter the file to include the checksum -- for all intents and purposes this isn't much different from being able to break the md5 hashing algorithm.

There's no such thing as "impossible" in cryptography, but the science does acknowledge the concept of "practically undoable" or "statistically improbable" and that's pretty much what you're dealing with here, at the moment.


Update: thinking about it again, I found a method that should allow the construction of a file containing its own MD5 much faster than what I was explaining initially. The new cost should be about 265 elementary invocations of MD5, i.e. a lot less than the 2119 I was talking about; it would even be technologically feasible (with a budget counted in millions of dollars -- but not billions). See at the end for a description of the new method.


Original answer:

Let's assume that MD5 is a "perfect" hash function which can be modeled as a random oracle. A random oracle is a function for which you know nothing of the output for a given input before trying it once. For a random oracle, the best method to achieve what you are looking for is hope: you try random input messages until you find one which contains its own hash. The question is then: what size of input messages should you use ?

MD5 processes data by adding some bits of padding (at least 65, at most 576) so that the length is a multiple of 512; then data is split into 512-bit blocks. The cost of hashing a message is directly proportional to the number of such blocks. I.e. for a n-bit message, the cost is ceil((n+65)/512). A n-bit message, on the other hand, offers n-127 subsequences of 128 bits. Longer messages make it more probable to succeed at each message (in a linear way) but cost more to process (linearly too). So message length is mostly neutral, except that the overhead implied by the padding is larger when using short messages. Overall, with large enough random messages (e.g. 8 kB), you will find a message which contains its own MD5 in average cost about 2119 MD5 elementary evaluation. An elementary evaluation of MD5 uses a few hundred clock cycles on a recent CPU, and 2119 is totally unachievable with today's technology (and tomorrow's technology, too).

(The "big file with all 128-bit sequence" that Graham Lee is talking about is just a special case of this generic method, with a single very large message.)

Now MD5 is widely known to not be a random oracle -- if only because collisions on MD5 can be computed efficiently, something which is not possible with a random oracle. So it is conceivable that shortcuts exploiting weaknesses in MD5 structure exist. However, I am not aware of any attack leading to a message containing its own MD5; this looks like a problem close to preimage resistance, something which is viewed as substantially more difficult than collisions.


New method:

MD5, like most (If not all) hash functions, is streamed: when it processes a long input, it does so in one pass, keeping a small fixed-size running state. For MD5 specifically, the running state has size 128 bits (16 bytes), and data is processed in chunks of 512 bits (64 bytes). An important consequence is the following: if you have inputs m and m||x ("||" denotes concatenation), and you want to compute both MD5(m) and MD5(m||x), then the extra cost needed to compute the second one is proportional to the size of x, but NOT to the size of m. In other words, if you have a 1 gigabyte input m, compute MD5(m), and then want to compute the MD5 of m followed by a 20-byte trailer x, then that second MD5 can reuse much of the work done for the first one, and will be almost free.

This leads to the following algorithm for finding a message m that contains its own MD5:

  1. Start with some value m.
  2. Compute MD5(m). If it is part of m, then exit (we found our message).
  3. Replace m with m||x where x is such that the sequence of the last 128 bits of m||x did not appear anywhere in m.
  4. Loop to 2.

Finding the right "x" value at each step can be done by using a De Bruijn sequence. Use B(2, 128) as the base sequence if each x is a single bit. If you want a byte-oriented solution (the message m must consist of an integral number of bytes, and MD5(m) must appear within m at a byte boundary), then use B(256, 16).

To compute the average number of iterations needed to find a hit, consider that at iteration n, the message m contains n distinct subsequences of 128 bits (or 16 bytes), so the total accumulated number of comparisons will be n(n+1)/2. Assuming MD5 to be a random oracle, then each comparison has probability 2-128 of being a hit, so n will have, on average, to be such that n(n+1)/2 = 2128 -- which translates to n = 264.5 iterations.

However, each iteration involves computing a MD5(m||x) where x is very small (one bit or one byte), and MD5(m) has been computed; this will usually require only one extra elementary MD5 computation (processing of a single 64-byte block). (If x are bits then only one iteration in 512 will require processing two blocks; if x are bytes then this becomes one iteration in 64.)

Either way, the hard part will be the lookup. Getting all subsequences in an index suitably sorted for fast lookup will require an awful lot of fast RAM, which would probably be way more expensive than computing the 264.5 MD5. However, some De Bruijn sequences allow for a fast, storage-free decoding. Therefore, with this algorithm, we can find a message m that contains its own MD5 for a cost close to 265 computations of MD5. The resulting message will have length about 3.3*1018 bytes, i.e. about one million modern hard disks (eight times as much if we want a byte-oriented solution).

It may be noted that the algorithm can be started with an arbitrary message m, of any size. That starting point will appear at the start of the self-MD5 file that the algorithm produces.

(In my original answer, the mistake was in this sentence: "Longer messages make it more probable to succeed at each message (in a linear way) but cost more to process (linearly too)." As explained above, longer messages can still be processed very efficiently as long as we generate them by reusing a common prefix, as in my new algorithm.)