How to safely store sensitive data like a social security number?

What you're looking for is deterministic encryption: that the same value encrypted twice gives the same output. Given deterministic encryption with a key K, an attacker would need the key to determine which SSN maps to which encrypted value. You can still perform searches on the deterministically encrypted data, but only equivalency comparisons (==, !=).

Examples of deterministic crypto that would work:

  • Block ciphers in ECB mode, if the data is <1 block long
  • Block ciphers in CBC mode, with a static IV.
  • Block ciphers in CBC mode with an IV derived from the plaintext. (Note that you don't want to store the IV then, so decryption without the plaintext is thus impossible, so this is a search-only option.)

What won't work:

  • CTR Mode with a static IV (an attacker can then use multiple ciphertexts to recover the keystream & plaintexts)
  • CBC Mode with a random IV (can't search)
  • Any stream cipher (same as CTR mode)

Note that, in all cases, you are giving up ciphertext indistinguishability, but that's a core requirement of being able to search on the ciphertexts.

You do need a mechanism to share the key with other systems that need access to the plaintext, but an attacker who gains access to a database backup, SQL injection, or any other attack that gives access only to the database won't be able to discern the plaintexts.

PKI is not useful here, as you point out, as having the public key allows to enumerate the values and recover them, if you're using a deterministic PKI cryptosystem (plain, unpadded, RSA, for example). Using a non-deterministic PKI (padded RSA) will not allow you to search on the ciphertexts.

I would review whether you really need to encrypt small, easily brute forced plaintexts. What is your threat model? Can you protect against these threats in other ways?


Keep in mind there are two separate pieces to securing this data, when it's at rest and when it's in transit.

You should not store (data at rest) any kind sensitive data directly in clear text, period. Things like passwords and social security, and credit card numbers should be encrypted before they are stored on disk. I agree with lorenzog about decoupling your solution but I suggest a slightly different setup:

  1. Database server. This server stores sensitive encrypted fields in a database (SQL/MySQL/Oracle), but never has the cleartext data. It will be encrypted before it's stored in the database table / field. It also does not have the private key to decrypt the data, just encrypted blobs.

  2. Crypto application server. This server stores the private key used for encrypting and decrypting the fields for an authenticated, authorized user. This is the only place the data stored in the database server can be encrypted and decrypted. Obviously this will be a high asset target, and should be hardened and controlled via policy. Treat similar to a domain controller for example and audit all access and queries to it.

  3. Web Server. Load balance requests from the user and secure communication between servers and services. Serve as endpoint for communication to external users.

Communication (data in transit) with the client and your partner teams is also very important here, don't over look that. Make sure you are using SSL and at the highest levels of ciphers and encryption possible.

It won't be easy to setup (harder than no basic security for sure, but not impossible by any means) and if you breach your customers trust you'll be in much worst shape than the time it takes to get securing personal data right. :)

Good luck!


Actually, you have THREE problems that you have implied in your question.

  • The title talks about data at rest.
  • In the question, you talk about access control as well.
  • In addition, you then also have a question of data in transit.

The question may have a different answer if you are already using a DB system and introducing encryption in an existing system. Many of the DB systems now support such security features (see below).

Access control and data in transit

Most DB systems support access control from the first day (it's almost a min requirement). However, when you say the such and such system needs to be able to read it, it's really an access control question.

Likewise, data in transit is also a question of the protocols used, many of which are supported by existing DB system(s). For example, SQL Server supports SSL for connections, as does MySQL. (Search for others, they might support it too.)

Encryption at rest

The third is encryption at rest, which solves the issue of if an unauthorized person or system were to get the actual DB file, what do they see. It also comes a related issue of key management, i.e. why can't whoever got your DB file not get the keys?

During the design, it would be prudent to assume that one day the key(s) could be compromised or stolen or, purely from a crypto agility point of view, you will have to change the algorithm and keys (e.g. whoever used DES had to eventually move to AES). Even though it can't be 0 cost, there has to be a path esp. if your DB is going to be a distributed one, to change either the algorithm or the key.

Many DBs now do provide encryption at rest along with some key management solutions. For example SQL Server has supported encryption since 2008. In addition, SQL server has published a key lifecycle management story too with apparently supports symmetric as well as asymmetric keys (via certificates). I believe SQL also supports full DB encryption vs selected fields via queries (such as in your case for SSN).

Likewise MySQL also supports encryption via query functions, which you could utilize for your SSN scenario. You can likewise other DB systems as well that might already support encryption and use those.

If you utilize a system that support built-in encryption, you are likely to avoid many pitfalls associated with doing it your own, as well as get a supported system.

Research DB

CryptDB is a DB system developed at MIT which encrypts data at rest and also supports running queries over encrypted data. If you look at the page for the system, it lists organizations that are actually using it.

Writing own encryption logic

This is probably more time consuming and more challenging to get it right, but based on your question, it seems that you are contemplating this as an issue. If I were in a similar situation, I would definitely avoid it and go with one of the existing DB systems.

There are many issues. For example, when you encrypt data, the output is somewhat randomized so encrypting the same data with the same key will usually not result in the same cipher text. It might be a bit challenging and you may have to decrease entropy (e.g. by using the same IVs or salts) which might impact the security of your system. And with something as simple like as storing hashes (or even HMACs with a single key), if someone gets the database file(s), they can run brute force to recover the data within weeks, if not days. This is especially true of fields like SSN, unless you were to spend time and always require multiple fields for a query (e.g. SSN and DOB and first three letters of last name, or such combinations), and only store those as hashed but neither of these separately. This will increase entropy and make it harder for someone to find actual values were they to get your DB file.

Other than that, one has to figure out key lifecycle management issues.

EDIT: It's actually a common issue and I had once evaluated encrypting data, when I wrote the initial response, I did not include that here. I have since updated my response to include that, as well as clarify access control, secure connection and data at rest issues.

Tags:

Encryption