# Attack against OTP Cipher

### How does a One-Time Pad work?

Imagine you have a message *M* which is encrypted with a key *K*, which then results in a ciphertext *C*. Let us assume that the process through which the encryption occurs is an XOR, which I'll show by the ^ symbol. Furthermore, we assume *M*, *K* and *C* all have the same length. So we know:

```
M ^ K = C
```

Assuming an attacker has access to *C* and wants to recover *M*, this is cryptographically impossible. Why? Because even if an attacker tried every single possible key *K*, they would receive every single possible message *M'*. It is impossible for them to tell which is the correct message. In fact, they could simply look at every single possible message of that length and they would not be any wiser.

However...

### Why is it called One-Time Pad?

Because the same key can only be used once. If you use it twice, certain issues arise.

Let's take the same scenario as above, but now we have two messages *M1* and *M2*, one key *K* and two cipher *C1* and *C2*.

```
M1 ^ K = C1
M2 ^ K = C2
```

The curious thing about XOR is that it is a "reversible" operation. That means XOR'ing something with the same value twice results in the original value:

```
X ^ X = 0
X ^ 0 = X
therefore
X ^ Y ^ X = Y
```

Order of operations does not matter, just like in addition. Let's assume that the attacker has *C1* and *C2*, but not access to *M1* and *M2*.

```
C1 ^ C2 = (M1 ^ K) ^ (M2 ^ K)
```

Since order of operations does not matter, we can remove the parenthesis and group the *K* together:

```
C1 ^ C2 = (M1 ^ M2) ^ (K ^ K)
```

We learned above that a value XOR'd by itself is 0, and that a value XOR'd by 0 is itself. As such, we can simply remove the right parenthesis and get:

```
C1 ^ C2 = M1 ^ M2
```

This means that you now have access to two plaintext messages XOR'd to each other. This is a lot more information than just the ciphertexts.

### How can I go from there?

Let us assume that a message is only uppercase ASCII and spaces, and you know the following:

```
C1 = 55 3a 90 26 b3 b6 48 37 6f c1 45 f7 e8 47 61 78 21 52
C2 = 42 33 97 55 ca b0 4e 37 61 ca 24 f9 e6 34 66 71 2f 44
C1 ^ C2 = 17 09 07 73 79 06 06 00 0e 0b 61 0e 0e 73 07 09 0e 16
```

At first glance, this may not seem to tell us a lot. After all, it's just some hexadecimal, right?

Well, if you look at the values in the last line, you see some low values and some high values. Let's look at the high values, the first one being `0x73`

, which is the ASCII value of the lowercase `s`

. Why is this interesting? We know that it's the result of the XOR of the message *M1* with *M2*, and both can only be uppercase and spaces. If you look closely at the ASCII value of a space, you see it's `0x20`

or `0010 0000`

in binary. Meaning that XOR'ing with a space only flips one bit. If you look at the ASCII table, you will notice that uppercase and lowercase characters also only differ by one bit. This was done so that "to Upper" and "to Lower", as well as "toggle case" functions only had to operate on one bit.

So we know the following:

Either of the following is true:

- The fourth byte of
*M1*is`S`

- The fourth byte of
*M2*is`_`

- The fourth byte of
*K*is`0x75`

or

- The fourth byte of
*M2*is`S`

- The fourth byte of
*M1*is`_`

- The fourth byte of
*K*is`0x06`

(Note that I am using `_`

to represent a space for better visibility)

We cannot yet tell which one of these is true, but we know they are mutually exclusive. If you have more messages *C2*, *C3*, etc. all encrypted by the same key, then you can simply determine which one is the one with the space, as all others will return either valid lowercase ASCII symbols or `0x00`

.

In fact, let's assume you intercepted a third message *C3*, with the value `4826f938d2a63b596dcc45f8e8347778354e`

Now you know:

```
C1 = 55 3a 90 26 b3 b6 48 37 6f c1 45 f7 e8 47 61 78 21 52
C2 = 42 33 97 55 ca b0 4e 37 61 ca 24 f9 e6 34 66 71 2f 44
C3 = 48 26 f9 38 d2 a6 3b 59 6d cc 45 f8 e8 34 77 78 35 4e
C1 ^ C2 = 17 09 07 73 79 06 06 00 0e 0b 61 0e 0e 73 07 09 0e 16
C1 ^ C3 = 1d 1c 69 1e 61 10 73 6e 02 0d 00 0f 00 73 16 00 14 1c
C2 ^ C3 = 0a 15 6e 6d 18 16 75 6e 0c 06 61 01 0e 00 11 09 1a 0a
```

This already gives us a bit more information. A lot more, in fact. First of all, let's have a look at the above hypothesis: We know either *M1[4]* is `_`

or M2[4] is `_`

. Since *C1 ^ C3* (and thus *M1 ^ M3*) does not result in a lowercase ACII character, we know both *M1[4]* and *M3[4]* are not spaces. Therefore we know that the first of our two hypothesized cases is true, and we know a bit more about the key and the other messages:

```
C1 = 55 3a 90 26 b3 b6 48 37 6f c1 45 f7 e8 47 61 78 21 52
C2 = 42 33 97 55 ca b0 4e 37 61 ca 24 f9 e6 34 66 71 2f 44
C3 = 48 26 f9 38 d2 a6 3b 59 6d cc 45 f8 e8 34 77 78 35 4e
M1 = ?? ?? ?? S ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ??
M2 = ?? ?? ?? __ ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ??
M3 = ?? ?? ?? M ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ??
K = ?? ?? ?? 75 ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ??
C1 ^ C2 = 17 09 07 73 79 06 06 00 0e 0b 61 0e 0e 73 07 09 0e 16
C1 ^ C3 = 1d 1c 69 1e 61 10 73 6e 02 0d 00 0f 00 73 16 00 14 1c
C2 ^ C3 = 0a 15 6e 6d 18 16 75 6e 0c 06 61 01 0e 00 11 09 1a 0a
```

We can also see that *M1[3] ^ M3[3]* and *M2[3] ^ M3[3]* result in printable characters, so we know *M3[3]* must be `_`

and therefore *K[3] = C3[3] ^ 0x20*, which is `d9`

.

After doing all of this, your grid should look like this:

```
C1 = 55 3a 90 26 b3 b6 48 37 6f c1 45 f7 e8 47 61 78 21 52
C2 = 42 33 97 55 ca b0 4e 37 61 ca 24 f9 e6 34 66 71 2f 44
C3 = 48 26 f9 38 d2 a6 3b 59 6d cc 45 f8 e8 34 77 78 35 4e
M1 = ?? ?? I S __ ?? S __ ?? ?? __ ?? ?? S ?? ?? ?? ??
M2 = ?? ?? N __ Y ?? U __ ?? ?? A ?? ?? __ ?? ?? ?? ??
M3 = ?? ?? __ M A ?? __ N ?? ?? __ ?? ?? __ ?? ?? ?? ??
K = ?? ?? d9 75 93 ?? 1b 17 ?? ?? 65 ?? ?? 14 ?? ?? ?? ??
C1 ^ C2 = 17 09 07 73 79 06 06 00 0e 0b 61 0e 0e 73 07 09 0e 16
C1 ^ C3 = 1d 1c 69 1e 61 10 73 6e 02 0d 00 0f 00 73 16 00 14 1c
C2 ^ C3 = 0a 15 6e 6d 18 16 75 6e 0c 06 61 01 0e 00 11 09 1a 0a
```

### Are we stuck now?

This is as much as you can infer with 100% certainty. Now you can start to make some educated guesses. For example, you can see that *M2* there is a three-letter word starting with `Y`

and ending in `U`

. You can make an educated guess here and assume that means "you". And indeed, if we assume that, we'd get the following:

```
M1 = ?? ?? I S __ I S __ ?? ?? __ ?? ?? S ?? ?? ?? ??
M2 = ?? ?? N __ Y O U __ ?? ?? A ?? ?? __ ?? ?? ?? ??
M3 = ?? ?? __ M A Y __ N ?? ?? __ ?? ?? __ ?? ?? ?? ??
K = ?? ?? d9 75 93 ff 1b 17 ?? ?? 65 ?? ?? 14 ?? ?? ?? ??
```

The other characters seem to fit. *M1* forms the word "is" and *M3* forms the word "may". Since those are legitimate English words, you can be pretty certain that those are correct. Furthermore, note that some of the XOR'd messages result in `0x00`

, which means that these must be the same letter. For example, even though you don't know what *M1[13] ^ M2[13]* is, you know that both of these letters must be identical.

Keep going from there and see if you can crack the rest.

I will introduce a modern attack on (IV, key) pair re-use in Stream Ciphers and key reuse on OTP.

- A Natural Language Approach to Automated Cryptanalysis of Two-time Pads by Mason at al. in 2006

Bolds are mine, and the abstract;

While keystream reuse in stream ciphers and one-time pads has been a well known problem for several decades, the risk to real systems has been underappreciated.

Previous techniques have relied on being able to accurately guess words and phrases that appear in one of the plaintext messages, making it far easier to claim that “an attacker would never be able to do that.” In this paper,we show how an adversary can automatically recover messages encrypted under the same keystream if only the type of each message is known(e.g. an HTML page in English). Our method, which is related to HMMs, recovers the most probable plaintext of this type by using a statistical language model and a dynamic programming algorithm.It produces up to 99% accuracy on realistic data and can process ciphertexts at 200ms per byte on a $2,000 PC.To further demonstrate the practical effectiveness of the method, we show that our tool can recover documents encrypted by Microsoft Word 2002

Now, the two-time pad (or many-time pad) attack is no more a hand job.

- Never ever re-use a key in OTP and similarly
- never ever use (key,IV) pair in CTR, OFB, CFB, ChaCha series, GCM...