Html To Text with C#

Posted by sajithm on Feb 17th, 2011
2011
Feb 17

Was working on a requirement to extract the text from html page. Here is a solution that handles most cases.

1
2
3
4
5
6
7
private static String HtmlToText(String html)
{
    Regex regFind = new Regex("(?<headText>.*?)<body(?<bodyAttrib>.*?)>(?<bodyText>.*)</body>.*", RegexOptions.Compiled | RegexOptions.ExplicitCapture | RegexOptions.IgnoreCase | RegexOptions.IgnorePatternWhitespace | RegexOptions.Singleline);
    Regex regReplace = new Regex("((<script.*?(/>|</script>))|(<style.*?(/>|</style>))|<.*?>)", RegexOptions.Compiled | RegexOptions.ExplicitCapture | RegexOptions.IgnoreCase | RegexOptions.IgnorePatternWhitespace | RegexOptions.Singleline);
    String text = regReplace.Replace(regFind.Match(html).Groups["bodyText"].Captures[0].Value, String.Empty);
    return System.Net.WebUtility.HtmlDecode(text);
}