Best way to handle security and avoid XSS with user entered URLs

If you think URLs can't contain code, think again!

https://owasp.org/www-community/xss-filter-evasion-cheatsheet

Read that, and weep.

Here's how we do it on Stack Overflow:

/// <summary>
/// returns "safe" URL, stripping anything outside normal charsets for URL
/// </summary>
public static string SanitizeUrl(string url)
{
    return Regex.Replace(url, @"[^-A-Za-z0-9+&@#/%?=~_|!:,.;\(\)]", "");
}

The process of rendering a link "safe" should go through three or four steps:

  • Unescape/re-encode the string you've been given (RSnake has documented a number of tricks at http://ha.ckers.org/xss.html that use escaping and UTF encodings).
  • Clean the link up: Regexes are a good start - make sure to truncate the string or throw it away if it contains a " (or whatever you use to close the attributes in your output); If you're doing the links only as references to other information you can also force the protocol at the end of this process - if the portion before the first colon is not 'http' or 'https' then append 'http://' to the start. This allows you to create usable links from incomplete input as a user would type into a browser and gives you a last shot at tripping up whatever mischief someone has tried to sneak in.
  • Check that the result is a well formed URL (protocol://host.domain[:port][/path][/[file]][?queryField=queryValue][#anchor]).
  • Possibly check the result against a site blacklist or try to fetch it through some sort of malware checker.

If security is a priority I would hope that the users would forgive a bit of paranoia in this process, even if it does end up throwing away some safe links.


Use a library, such as OWASP-ESAPI API:

  • PHP - http://code.google.com/p/owasp-esapi-php/
  • Java - http://code.google.com/p/owasp-esapi-java/
  • .NET - http://code.google.com/p/owasp-esapi-dotnet/
  • Python - http://code.google.com/p/owasp-esapi-python/

Read the following:

  • https://www.golemtechnologies.com/articles/prevent-xss#how-to-prevent-cross-site-scripting
  • https://www.owasp.org/
  • http://www.secbytes.com/blog/?p=253

For example:

$url = "http://stackoverflow.com"; // e.g., $_GET["user-homepage"];
$esapi = new ESAPI( "/etc/php5/esapi/ESAPI.xml" ); // Modified copy of ESAPI.xml
$sanitizer = ESAPI::getSanitizer();
$sanitized_url = $sanitizer->getSanitizedURL( "user-homepage", $url );

Another example is to use a built-in function. PHP's filter_var function is an example:

$url = "http://stackoverflow.com"; // e.g., $_GET["user-homepage"];
$sanitized_url = filter_var($url, FILTER_SANITIZE_URL);

Using filter_var allows javascript calls, and filters out schemes that are neither http nor https. Using the OWASP ESAPI Sanitizer is probably the best option.

Still another example is the code from WordPress:

  • http://core.trac.wordpress.org/browser/tags/3.5.1/wp-includes/formatting.php#L2561

Additionally, since there is no way of knowing where the URL links (i.e., it might be a valid URL, but the contents of the URL could be mischievous), Google has a safe browsing API you can call:

  • https://developers.google.com/safe-browsing/lookup_guide

Rolling your own regex for sanitation is problematic for several reasons:

  • Unless you are Jon Skeet, the code will have errors.
  • Existing APIs have many hours of review and testing behind them.
  • Existing URL-validation APIs consider internationalization.
  • Existing APIs will be kept up-to-date with emerging standards.

Other issues to consider:

  • What schemes do you permit (are file:/// and telnet:// acceptable)?
  • What restrictions do you want to place on the content of the URL (are malware URLs acceptable)?

Just HTMLEncode the links when you output them. Make sure you don't allow javascript: links. (It's best to have a whitelist of protocols that are accepted, e.g., http, https, and mailto.)