Sanitising HTML with C# and Tidy

2 minute read

(*See article end for updates)

A while back I was given the task of sanitising HTML entered in by users in our Discussions are at Huddle. After searching around I initially couldn't find many easy to implement tools out there to help me.

I firstly came across Jeff Atwood's solution for cleaning HTML. A very fast solution, although not water tight. Regular expressions are great for string matching when you know what to expect, but in many cases the attacks are written so randomly it's possible for them to slip through.

The best approach to Jeff's method would be to HtmlEncode the whole string, then anything "safe" that is matched in the encoded string, is decoded back. This would be a true whitelist solution (as opposed to blacklisting). In fact prior to that it would first be better to clean up the HTML.

Hence, next I came across the HTML Agility Pack, and Tidy.Net (a .Net port). Both of which are html parsers, and in Tidy's case, a cleaner too.

I decided to go with the more mature product, Tidy. There were two choices, to use the original version of Tidy (using some sort of COM wrapper) or opt for Tidy.Net (a less mature .Net port). For speed, and ease I went for Tidy.Net.

Before writing any application code I went to, and for each attack I wrote a unit test. My current solution can be found on Refactor My Code. It's by no means perfect, but at the moment is working well enough. I do plan to re-visit it though and finish up a few things. In fact once finished I'll post the zip of the code on here.

Here's a quick walk-through of it:

  1. The HTML string is passed to Tidy.Net and cleaned. Tidy.Net takes care of making sure the HTML is well formed, this makes pattern matching with a regular expression much more easier and accurate.
  2. The output from Tidy.Net has to be cleaned slightly as it adds html, head, body etc... tags to the output (which we don't need).
  3. Once cleaned I apply pattern matching. One thing I will be changing is to HtmlEncode the whole string, and then Decode any safe matches back. This would ensure proper whitelisting.
  4. Using a static dictionary collection I can specify the allowed tags and attributes. There's potential to improve this, possibly using XML and loading it via IoC (as a singleton).

I'll try to update the code and get it published in the next week or two.

UPDATE - 19/01/2009

Here's the latest code as it stands. It still needs work, this update just addresses a few odds bugs in Tidy.Net I found (spaces being removed, line breaks being added etc...). I haven't looked at whitelisting yet, not had time (just got back from snowboarding in the alps), plus I've got loads of other stuff still on.

UPDATE - 06/06/2009

In the above latest code link, I've now included some of the unit tests. Around 15 I think. There are about another 6 but I need to clean them up.