Canonicalization & Output Encoding

  • What is "output encoding", and can someone provide a concrete example of how a validation routine could make use of it?

Output encoding means that the data is encoded appropriately for the context into which it is being placed. Example, say you want to dynamically display a name from an untrusted source : Your name is:<b>Foo bar</b> If the name contains html characters, you want those to be encoded for, so the result is <b>Foo &lt;i&gt Bar</b> instead of <b>Foo <i> Bar</b>.

So, converting < to &lt; is an example of html encoding. However, if the context is an html attribute, you may have to also encode space-characters, since an attribute may be unquoted, and a space may thus break the attribute and the input can create a new attribute: <input value=data> is attacked with: <input value=data onclick=javascript:alert(1)/>

  • What is "double encoding", and why is it an "obfuscation attack"?

When you type certain characters into a URL, these become URL-encoded (usually, though not in IE always):

  1. Not encoded parameter: test<script>alert(1)</script>
  2. URL-encoded parameter: test%3Cscript%3Ealert%281%29%3C%2fscript%3E
  3. Double-encoded parameter: test%253Cscript%253Ealert%25281%2529%253C%252fscript%253E

Depending on the handling of input parameters, double encoding may pass through some filters/validators and wind up breaking the context where they are echoed (thus leading to XSS).

  • What is "canonicalization" and why does it prevent against double encoding?

Canonicalization is the act of writing something in the simplest form, thus the canonical form of something is the "simplest" form to write it. To canonicalize in this context, it means un-encoding data until it does not change anymore.

A triple encoded <-sign, goes through the following transformations:

  1. %25253C
  2. %253C
  3. %3C
  4. <

Another example can be if input is written as e.g octal escapes, overlong UTF sequences and esoteric encodings, such as UTF-7. The canonicalization converts these into a common base, for the sake of disambiguation.


I think the best way to describe canonicalization is to remember that it stems from canon, meaning an authentic piece of writing. What they're talking about is taking untrusted data and formatting it as an unambiguous representation, such that it can never be misrepresented by any software process.

The first step is to take your input and store it somewhere. Your input might be encoded as ASCII, UTF-8, UTF-16, or any number of other encoding schemes. The software must detect this and appropriately convert and store the data in a single format. It is now in a single unambiguous format, and therefore known to be correct when interpreted as such, i.e. it is canon. This allows for absolute certainty when later outputting the data.

For example, if I insert '; DROP TABLE users; -- into a form, it might cause an SQL injection if the app is poorly written. However, with canonicalization, the data is only data, and cannot possibly be represented as part of an SQL query. In reality, SQL's form of canonicalization is parameterized queries. Furthermore, steps must be taken to convert text encoding to a single known type, so that only valid codepoints are stored. If this is not done, a codepoint may be misinterpreted as a different character.

A similar example can be given for output into HTML. If the database contains <script>alert('xss!');</script>, then a naive app might just write that to the page directly and introduce a security issue. However, with proper canonicalization in the form of output encoding, we'd get &lt;script&gt;alert('xss!');&lt;/script&gt;, which a browser cannot misinterpret.

Double encoding is a trick used to fool certain parsers. The attacker identifies the encoding you're using, then pre-encodes their data in this format. The parser wrongly assumes the data to be canon, and handles it as such. The result is that the data is mishandled, such that an exploit takes place. It's an obfuscation attack, because the attacker is obfuscating exploit data, such that the encoder doesn't see bad characters.