Strict HTML Validation and Filtering in PHP

I used HTML Purifier with success and haven't had any xss or other unwanted input filter through. I also run the sanitize HTML through the Tidy extension to make sure it validates as well.


I've tested all exploits I know on HTML Purifier and it did very well. It filters not only HTML, but also CSS and URLs.

Once you narrow elements and attributes to innocent ones, the pitfalls are in attribute content – javascript: pseudo-URLs (IE allows tab characters in protocol name - java	script: still works) and CSS properties that trigger JS.

Parsing of URLs may be tricky, e.g. these are valid: http://spoof.com:[email protected] or //evil.com. Internationalized domains (IDN) can be written in two ways – Unicode and punycode.

Go with HTML Purifier – it has most of these worked out. If you just want to fix broken HTML, then use HTML Tidy (it's available as PHP extension).


User-submitted HTML isn't always valid, or indeed complete. Browsers will interpret a wide range of invalid HTML and you should make sure you can catch it.

Also be aware of the valid-looking:

<img src="http://www.mysite.com/logout" />

and

<a href="javascript:alert('xss hole');">click</a>