Html To Text with C#

Posted by sajithm on Feb 17th, 2011
2011
Feb 17

Was working on a requirement to extract the text from html page. Here is a solution that handles most cases.

1
2
3
4
5
6
7
private static String HtmlToText(String html)
{
    Regex regFind = new Regex("(?<headText>.*?)<body(?<bodyAttrib>.*?)>(?<bodyText>.*)</body>.*", RegexOptions.Compiled | RegexOptions.ExplicitCapture | RegexOptions.IgnoreCase | RegexOptions.IgnorePatternWhitespace | RegexOptions.Singleline);
    Regex regReplace = new Regex("((<script.*?(/>|</script>))|(<style.*?(/>|</style>))|<.*?>)", RegexOptions.Compiled | RegexOptions.ExplicitCapture | RegexOptions.IgnoreCase | RegexOptions.IgnorePatternWhitespace | RegexOptions.Singleline);
    String text = regReplace.Replace(regFind.Match(html).Groups["bodyText"].Captures[0].Value, String.Empty);
    return System.Net.WebUtility.HtmlDecode(text);
}

Leave a Comment




XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Please note: Comment moderation is enabled and may delay your comment. There is no need to resubmit your comment.