Data obfuscation in SQL Server

I wish I could upvote you 100 points just for thinking about this! I have seen this subject overlooked so many times it's untrue - so well done. From what I understand you actually want to scramble the data within the fields themselves, and although I understand what you are trying to achieve it might not be quite necessary to do so - although it should be considered on a case-by-case basis.

Most data protection laws revolve around the ability to correctly associate a piece of data with an individual - for example a date of birth or a phone number. You can meet the requirements of the law by ensuring that when you move your data out of production into UAT it is jumbled up so it is not easily re-mapped to the original person - especially when you jumble forename and surnames.

However, this does not address the issue for instance of let's say contact details. You can meet the requirements of the law by jumbling the data but the phone numbers are still real, the emails still real etc... they are just not assigned to the correct person. For this I recommend if at all possible clearing that data before passing it into UAT, Red Gate do a piece of software called Data Generator that can create random test data for you so that you can repopulate the fields with data that can be tested against.

As for data scrambling: there exists many applications that do this for you and honestly you are correct in not wanting to reinvent the wheel. The one that we use at our company is a product called Data Masker by a company called Net2000. The license is pretty cheap, it works extremely fast and you don't have to worry about having to disable all your constraints before scrambling the database.

You can of course roll your own solution should you not find anything that meets your requirements - if you do decide to do this I would strongly recommend using CLR procedures to do it as it is much more flexible than pure TSQL (not to say that you can't use TSQL see here).

Once you have chosen an application to perform this for you the next thing you need to decide is what is it you actually want/need to scramble? Honestly your best resource for this is your company legal team and or the company auditors. I know that sometimes we may not like working with them but they will be much nicer to you for approaching them and asking them the question rather than trying to do it on your own and get it wrong, there is absolutely nothing wrong with asking for help - especially when it is as important as this.

I hope this helps you and I wish you good luck in your quest... ;-)


Mr. Brownstone hit the nail right on the head. Now to help you out a bit, here is my "garble" function, used to obfuscate strings (funny results with names!). Pass in a string, it returns a garbled string. Include it in update statements against string columns. Change the data length as you see fit.

---------------------
-- Garble Function --
---------------------
-- Make a function to slightly garble the strings
IF (object_id('fn_Garble') is not null)
  drop function fn_Garble
go
create function fn_Garble
(
  @String varchar(255)
)  
returns varchar(255)
as
BEGIN
  select @String = replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(@String,'o','e'),'a','o'),'i','a'),'u','i'),'t','p'),'c','k'),'d','th'),'ee','e'),'oo','or'),'ll','ski')
  return @String
END
go

I had to do this for my clients retail sales data. For names I went to the census and downloaded all the first and last names, ran them through a loop to join every first to every last, added sex code and loaded it into a table in all upper case. I then had a table with about 400 million unique names. I used upper case as our current data was not in upper case so I could more easily tell data that was scrubbed.

When I scrubbed my user data I swapped out the names, for birthday I put everyone to Jan 1 of the year they were actually born and updated any phone numbers with their zip code (my data was US only). Email addresses became firs initial plus last name @mycompany.co. The postal address gave me the most grief but I kept the city, state and zip because I believe them to not be an issue if the address is changed. I had a co-worker who had some program that generated garbled letters and updated the address line with that.

Anywhere I had duplicated data but still had a FK to the main user (bad design yes, but not mine) I updated that data too so the name was consistent across the database for user x.

Overall my data was still very readable although address did not make any sense. It took me a couple of days to get all this working but once it was done and a sql agent job was created I could scrub the data in as little as 15 minutes.