How is character encodings used to bypass XSS sanitizers?

The problem

Abusing character encodings is a popular trick to get XSS to work even when there are filters in place. There are a number of different situations when it works, but they all share common prerequesits:

  • The attacker sends a payload in character encoding A.
  • The server doing the filtering or sanitazion is working in character encoding B.
  • The victims browser is interpreting the page as if in character encoding A.

Let's look at two example of how this can happend.

Example #1: No encoding parameter in htmlspecialchars

This is a quite common sight in PHP:

echo htmlspecialchars($_GET["query"], ENT_COMPAT | ENT_HTML401);

The problem here is the default behaviour PHP falls back to when there is no encoding specified. From the manual:

If omitted, the default value of the encoding varies depending on the PHP version in use. In PHP 5.6 and later, the default_charset configuration option is used as the default value. PHP 5.4 and 5.5 will use UTF-8 as the default. Earlier versions of PHP use ISO-8859-1.

So what encoding PHP uses depends on your version and configuration. Great. So now all that stands between you and the abyss is someone making an innocent change in php.ini, or perhaps just something as simple as a server upgrade or reinstall. I too like to live dangerously... but not that dangerously.

Note that this example has nothing to do with the browser. Modern or old, it doesn't matter, because it's the server and not the browser that's the problem here.

The solution off course is to specify the correct encoding and making sure that the same is specified in the HTTP Content-Type header of the response:

echo htmlspecialchars($_GET["query"], ENT_COMPAT | ENT_HTML401, "UTF-8");

Example #2: Browser heuristics biting you

This is a problem if your server does not specify what encoding it is using in the response (or if it only does it in a meta tag that is to far down for the browser to care about it). If you do not tell the browser what encoding to use, it will have to guess. Unfortunately, all browsers aren't so good at that:

If certain strings of user input -- say, +ADw-script+AD4-alert(1)+ADw-/script+AD4- -- are echoed back early enough in the HTML page, Internet Explorer may incorrectly guess that the page is encoded in UTF-7. Suddenly, the otherwise harmless user input becomes active HTML and will execute.

The payload in the quote is <script>alert(1)</script> encoded in UTF-7. A sanitizer working in UTF-8 would see nothing dangerous in that payload and let it through, but the browser that is tricked into working in UTF-7 would still run it.

My understanding is that it is mostly old versions of IE where this is a problem. But I am not sure, so I would be happy to see another answer where it is clarified.

EDIT: See Xavier59's answer for a situation where it works on modern browsers.

The solution

What you need to do on the server is simple in theory. You need to make sure that the following is always true:

  • The character encoding of the response is corectly set in the HTTP headers.
  • The XSS filter is working in the same encoding as specified above.

In practice, it is surprisingly easy to get that wrong.


This come as an addition to Anders answer (which is great btw).

My understanding is that it is mostly old versions of IE where this is a problem. But I do not have a source for that, and I am not sure, so I would be happy to see another answer where it is clarified.

Yes, this affect modern browsers.


Let's take the following sanitization :

<?php
    header('Content-Type: text/html;charset=utf-8');
    echo preg_replace('/<\w+/', '', $_GET['name']).", can you p0wn it ?"
?>

This might not seem vulnerable because :

  • < followed by one or more letter is being removed so an attacker cannot open a new tag.
  • Content-Type header is correctly set to utf-8

Now, imagine that we send %00%3C%00, the regex parser will fail because < (%3C) is not followed by a letter (as defined by \w) but by %00 (the null byte). In UTF-8, the reflected input will not execute anything, but if we can find a way to get it reflected in UTF-16 ...

Here is what we can read from W3 :

If you have a UTF-8 byte-order mark (BOM) at the start of your file then recent browser versions other than Internet Explorer 10 or 11 will use that to determine that the encoding of your page is UTF-8. It has a higher precedence than any other declaration, including the HTTP header.

You could skip the meta encoding declaration if you have a BOM, but we recommend that you keep it, since it helps people looking at the source code to ascertain what the encoding of the page is.

The BOM character in UTF-16 is the unicode character U+FEFF (the different BOM encoding are best described on Wikipedia). So because our input is being reflected at the beginning of the dom, we can change the charset to UTF-16 and get our code to execute.

Complete payload :

%FE%FF%00%3C%00s%00c%00r%00i%00p%00t%00%3E%00a%00l%00e%00r%00t%00(%00%22%00P%000%00w%00n%00e%00d%00%22%00)%00;%00%3C%00/%00s%00c%00r%00i%00p%00t%00%3E

Here is a POC I made. Most xss auditors will not fall for it, but Firefox will since its auditor is disabled by default. (tested on Firefox Nightly 60.0a1 - last version as of today)

However, htmlspecialchars and htmlentities will not fall for it. Nonetheless, this shows that there are always tricky edge cases around the corner !


Other attacks on encoding include character mapping wich are also still relevant as of today.

Tags:

Php

Encoding

Xss