Get plain text from HTML in .NET

Personally, I found a combination of regex and HttpUtility to be the best and shortest solution.

Return HttpUtility.HtmlDecode(
                Regex.Replace(HtmlString, "<(.|\n)*?>", "")
                )

This removes all the tags, and then decodes any of the extras like < or >

~~There is no built-in solution in the framework.~~

If you need to parse HTML I made good experience using a library called HTML Agility Pack.
It parses an HTML file and provides access to it by DOM, similar to the XML classes.

You can use MSHTML, which can be pretty forgiving;

//using microsoft.mshtml
HTMLDocument htmldoc = new HTMLDocument();
IHTMLDocument2 htmldoc2 = (IHTMLDocument2)htmldoc;
htmldoc2.write(new object[] { "<p>Plateau <i>of<i> <b>Leng</b><hr /><b erp=\"arp\">2 sugars please</b> <xxx>what? &amp; who?" });

string txt = htmldoc2.body.outerText;

Plateau of Leng 2 sugars please what? & who?

There's no built in utility as far as I know, but depending on your requirements you could use Regular Expressions to strip out all of the tags:

string htmlString = @"<p>I'm HTML!</p>";
Regex.Replace(htmlString, @"<(.|\n)*?>", "");

Get plain text from HTML in .NET

Tags:

Html

.Net

String

Related

Recent Posts