How to protect against diacritics such as Zalgo text

is there even a limit?!

Not intrinsically in Unicode. There is the concept of a 'Stream-Safe' format in UAX-15 that sets a limit of 30 combiners... Unicode strings in general are not guaranteed to be Stream-Safe, but this could certainly be taken as a sign that Unicode don't intend to standardise new characters that would require a grapheme cluster longer than that.

30 is still an awful lot. The longest known natural-language grapheme cluster is the Tibetan Hakṣhmalawarayaṁ at 1 base plus 8 combiners, so for now it would be reasonable to normalise to NFD and disallow any sequence of more than 8 combiners in a row.

If you only care about common Western European languages you can probably bring that down to 2. So potentially compromise somewhere between those.

I think I found a solution using NormalizationForm.FormC instead of NormalizationForm.FormD. According to the MSDN:

[FormC] Indicates that a Unicode string is normalized using full canonical decomposition, followed by the replacement of sequences with their primary composites, if possible.

I take that to mean that it decomposes characters to their base form, then recomposes them based on a set of rules that remain consistent. I gather this is useful for comparison purposes, but in my case it works perfect. Characters like ü, é, and Ä are decomposed/recomposed accurately, while the bogus characters fail to recompose, and thus remain in their base form:

enter image description here

How to protect against diacritics such as Zalgo text

Tags:

C#

Unicode

User Input

Diacritics

Zalgo

Related

Recent Posts