How do one-way hash functions work? (Edited)

Since everyone until now has simply defined what a hash function was, I will bite.

A one-way function is not just a hash function -- a function that loses information -- but a function f for which, given an image y ("SE" or 294 in existing answers), it is difficult to find a pre-image x such that f(x)=y.

This is why they are called one-way: you can compute an image but you can't find a pre-image for a given image.

None of the ordinary hash function proposed until now in existing answers have this property. None of them are one-way cryptographic hash functions. For instance, given "SE", you can easily pick up the input "SXXXE", an input with the property that X-encode("SXXXE")=SE.

There are no "simple" one-way functions. They have to mix their inputs so well that not only you don't recognize the input at all in the output, but you don't recognize another input either.

SHA-1 and MD5 used to be popular one-way functions but they are both nearly broken (specialist know how to create pre-images for given images, or are nearly able to do so). There is a contest underway to choose a new standard one, which will be named SHA-3.

An obvious approach to invert a one-way function would be to compute many images and keep them in a table associating to each image the pre-image that produced it. To make this impossible in practice, all one-way function have a large output, at least 64 bits but possibly much larger (up to, say, 512 bits).

EDIT: How do most cryptographic hash functions work?

Usually they have at their core a single function that does complicated transformations on a block of bits (a block cipher). The function should be nearly bijective (it shouldn't map too many sequences to the same image, because that would cause weaknesses later) but it doesn't have to be exactly bijective. And this function is iterated a fixed number of times, enough to make the input (or any possible input) impossible to recognize.

Take the example of Skein, one of the strong candidates for the SHA-3 context. Its core function is iterated 72 times. The only number of iterations for which the creators of the function know how to sometimes relate the outputs to some inputs is 25. They say it has a "safety factor" of 2.9.

Think of a really basic hash - for the input string, return the sum of the ASCII values of each character.

hash( 'abc' ) = ascii('a')+ascii('b')+ascii('c')
              = 97 + 98 + 99
              = 294

Now, given the hash value of 294, can you tell what the original string was? Obviously not, because 'abc' and 'cba' (and countless others) give the same hash value.

Cryptographic hash functions work the same way, except that obviously the algorithm is much more complex. There are always going to be collisions, but if you know string s hashes to h, then it should be very difficult ("computationally infeasible") to construct another string that also hashes to h.

Shooting for a simple analogy here instead of a complex explanation.

To start with, let's break the subject down into two parts, one-way operations and hashing. What is a one-way operation and why would you want one?

One way operations are called that because they are not reversible. Most typical operations like addition and multiplication can be reversed while modulo division can not be reversed. Why is that important? Because you want to provide a output value which 1) is difficult to duplicate without the original inputs and 2) provides no way to figure out the inputs from the output.



4 + 3 = 7  

This can be reversed by taking the sum and subtracting one of the addends

7 - 3 = 4  


4 * 5 = 20  

This can be reversed by taking the product and dividing by one of the factors

20 / 4 = 5

Not Reversible

Modulo division:

22 % 7 = 1  

This can not be reversed because there is no operation that you can do to the quotient and the dividend to reconstitute the divisor (or vice versa).

Can you find an operation to fill in where the '?' is?

1  ?  7 = 22  
1  ?  22 = 7

With that being said, one-way hash functions have the same mathematical quality as modulo division.

Why is this important?

Lets say I gave you a key to a locker in a bus terminal that has one thousand lockers and asked you to deliver it to my banker. Being the smart guy you are, not to mention suspicious, you would immediately look on the key to see what locker number is written on the key. Knowing this, I've done a few devious things; first I found two numbers that when divided using modulo division gives me a number in the range between 1 and 1000, second I erased the original number and written on it the divisor from the pair of numbers, second I chose a bus terminal that has a guard protecting the lockers from miscreants by only letting people try one locker a day with their key, third the banker already knows the dividend so when he gets the key he can do the math and figure out the remainder and know which locker to open.

If I choose the operands wisely I can get near to a one-to-one relationship between the quotient and the dividend which forces you to try each locker because the answer spreads the results of the possible inputs over the range of desired numbers, the lockers available in the terminal. Basically, it means you can't acquire any knowledge about the remainder even if you know one of the operands.

So, now I can 'trust' you to deliver the key to its rightful owner without worrying that you can easily guess to which locker it belongs. Sure, you could brute force search all the lockers but that would take almost 3 years, plenty of time for my banker to use the key and empty the locker.

See the other answers for more specifics on the different hash functions.