What's a "safe" URL shortening algorithm?

Entropy is your friend. Using only alphanumeric characters (special characters are best avoided in this case because they often need URL encoding, which complicates things) you have a "language" of 62 possible characters to choose from. For a string of length X made from this "language", the total number of possible strings is simply:

62**X

If you start blocking an IP address after Y failed attempts then the odds that an attacker with a single IP address will guess a code are:

Y/(62**X)

But imagine an attacker can easily switch IP addresses, so let's imagine they have a million IP addresses at their disposal (note: the number will be much larger if you support IPV6). Therefore their odds of success are simply:

(1e6*Y)/(62**X)

Finally note (h/t @Falco) that the above assumes the attacker is looking for a particular code. If you are worried about someone finding any code then you need to further multiply by the number of active codes you have at a given time, which depends on how often they are created and how quickly they expire.

Given all of this though, you just have to decide how low you want the probability to be, plug in your Y, and solve for X. As a simple starting point I usually suggest a 32 character alphanumeric string (make sure and use a proper CSPRNG). If you block an IP after 1000 failed attempts then an attackers odds of finding a specific code are:

(1e6*1000)/(62**32)

Which is 4.400134339715791e-49. Given those odds, it's more likely that the attacker will win the lottery 4 or 5 times in a row before they guess a code. You could have billions of active codes at a time and the odds of guessing any one would still effectively be zero.


TL;DR: Don't bother with rate limiting. Just generate a secure random 128-bit (or 192-bit) token for each URL using your preferred crypto API / library and base64url encode it. Include the encoded token in the URL and also store it in a secure database with the associated user, form and expiration data.


Like Conor Mancone, I would also suggest just including a single random token with sufficient entropy in the URL. You should obviously use a cryptographically secure random number source to generate these tokens.

When generating the URL, you should store each token in a database along with any associated information needed to authenticate the user and display the correct form. You may also want to store a creation and/or expiration timestamp, both to limit the validity period of the URLs (and thus reduce the risk of old e-mails being compromised) and also simply to allow you to purge old records from the database.

As for what counts as "sufficient entropy", the precise lower limit obviously varies depending on your use case and threat model. In particular, assuming that you expect to have at most 2p valid URLs in your database at any given time, that your adversary can make at most 2q queries to your service and that they should have at most a one-in-2r chance of successfully guessing a valid URL, your tokens should be at least p + q + r bits long.

In practice, a pretty safe "industry standard" token length would be 128 bits. Assuming that you'll have at most 232 valid URLs at a time, a 128-bit token would require an attacker to make at least 264 queries to your service to have a 1/232 chance of guessing even a single valid URL. For most purposes, this should be more than enough even without any kind of rate limiting.

(Tangentially, a 128-bit token length also allows you to generate up to about 264 random tokens before you'll suffer your first token collision on average. But that's kind of irrelevant, since a database anyway allows you to detect collisions and handle them just by generating a new token.)

If your really wanted to be sure, you could go up to 192 or even 256 bits. A 192-bit token, for example, would allow you to have up to 264 URLs while requiring at least 264 queries for an attack success probability of 1/264. And a 256-bit token would increase the difficulty of the attack by an extra factor of 264 on top of that — not that I see how that could possibly be necessary for any realistic threat.

As for generating and encoding the tokens, I would suggest simply generating a random 128-bit (or 192-bit or 256-bit) bitstring using any cryptographic RNG of your choice and encoding it using URL-safe Base64. (Most programming language runtimes should have a suitable RNG built in, or at least easy to install as a library. And if not, your OS most likely provides one, e.g. as /dev/urandom on Unixish systems.) This will produce a 22-character string for a 128-bit token, a 32-character string for a 192-bit token or a 43-character string for a 256-bit token. And it's quite a bit simpler than generating the token one character at a time as Conor Mancone's answer suggests.


BTW, if you don't happen to have access to a convenient database and/or a secure RNG, another option would be to include all the necessary information (at least user ID, form ID and timestamp) in the URL itself together with a 128-bit cryptographic message authentication code of those values (computed and verified using a secret key stored on the server). Indeed, that's basically what JWT does to authenticate the tokens, just with a bit more overhead.

Note that, in this particular case, each token is only valid for a single user/form/timestamp combination, which the attacker must choose before attempting to guess the token, so effectively p = 0 (since 20 = 1). Thus, a somewhat shorter token can provide the same effective level of security than if using the random token method described earlier. Of course, this length saving is usually more than balanced by the extra parameters that need to be included in the URL.


If you want safety I will recommend UUIDv4 encoded in base58. In essence you get 22 alphanumeric characters which are URL-safe and they store full UUIDv4 which is (reasonably) guaranteed to be random and unguessable.

A nice write up on the subject: https://www.skitoy.com/p/base58-unique-ids/638/

Tags:

Url

Hash