White list or black list sanitation for international input?

That's why the character class [[:alnum:]] exists; it includes the characters which are considered valid alphanumerics in the currently active locale. Of course, that doesn't work well on a web server in the US when someone in Egypt is attempting to provide input through a form - and it doesn't work with punctuation. But it also doesn't include spaces, and that may be completely irrelevant.

---Edit--- Building on Mark's answer below and using http://www.regular-expressions.info/unicode.html as a reference, one could also use [\p{L}\p{N}] instead of the alnum character class in most common regexp implementations to recognize "all" unicode letters/numbers in all locales known to the regex engine in use. The choice basically comes down to whether the application doing the comparison knows what locale the input comes from or not. And, of course, whether the input is expected to be letters and numbers, or something else (proper names sometimes contain punctuation, for example). :) ---Edit---

To more directly answer the question - yes, a whitelist is always preferable. It's not always practical, though. Only someone familiar with the specific application can make the call as to what's actually practical.


Assuming you're asking this in the context of Web Development...

You can detect appropriate character sets with simple regex validation. However, you may also be falling victim to security theater: input sanitation is not the answer.

If you are trying to validate for specific locales, and you don't want to accept any other locales, you can choose specific ones using Regex. Here's an example:

  1. \p{InHan} for Chinese characters.
  2. \p{InArabic} for Arabic
  3. \p{InThai} for Thai

However, I'm with O'Rooney here: you should accept everything (as long as it's validated: length, null, format, whitelist), and use Prepared Statements with output sanitation.


Warnings About Language-based Whitelisting

If you insist on going with a unicode-range-based whitelist, then please keep in mind that you should still allow [a-zA-Z0-9], even though you're accepting only other locales. On the Chinese internet, people frequently type with English letters. For example, they may attempt to evade censorship by abbreviating characters (just text on wikipedia, but still NSFW). Many people also use pinyin and roman numerals.

You can also use Unicode ranges, but when you are using combined ideographs/language sets such as CJK (Chinese, Japanese, and Korean; I do believe \p{IsHan} is CJK) then, you will run into many validation issues.

If you want to exclude by language, you will have trouble with this concept when you're expecting Japanese input, but instead get Chinese input, or vice versa. The same concept applies with Korean against Chinese, or Japanese. You will need to find the appropriate unicode ranges, but note that some languages occasionally overlap: Chinese (Hanzi) and Japanese (Kanji) share some characters.

Because you're worried about accepted input, it sounds like you're looking for input sanitation. This is the wrong approach. You should not be "sanitizing" input that goes into a database. Whitelisting is fine (acceptable values, for example).

Sanitizing and Validating Input are two different things. What's the difference?

  1. Sanitizing input could look like this: stripApostrophesFromString(input);
  2. Input validation could look like this: if (input != null && input.Length == acceptableNumber && regexFormatIsValid(input) && isWithinAcceptableRanges(input)) { } else { }

For character-set validation, a variation of the listed regexes could suffice, but will not validate length, format, etc. If you're worried about SQL injection (and you should be), you should be using prepared statements with output sanitation.

Output sanitation is essentially converting bad characters, such as script tags, to their equivalent HTML entity. For example, < becomes &lt;, and > becomes &gt;.


Our answer is that for a truly international application, on general input such as people's names, you should accept everything and encode it at display time. Admittedly that (to some extent) passes the problem down to the guy writing the Encode algorithm.

However, if you have an input that is a specific thing, such as a vehicle number plate, or a business identification code, then you should validate it against those rules, regardless of being an international application. Again, a further caveat is that those rules might still be difficult to define, for example number plates symbols would vary with country.

(Edit) Why I prefer encoding over validation:

At the time of validation, that data could go potentially anywhere: a CSV text file, an SQL query, a web page, a config setting. You don't know, and cannot know, what the risky characters are.

At the time of encoding, by definition you know where the data is going, so you can then definitively encode the risky characters.