Html To Text with C#
Posted by sajithm on Feb 17th, 2011
2011
Feb 17
Was working on a requirement to extract the text from html page. Here is a solution that handles most cases.
1 2 3 4 5 6 7 | private static String HtmlToText(String html) { Regex regFind = new Regex("(?<headText>.*?)<body(?<bodyAttrib>.*?)>(?<bodyText>.*)</body>.*", RegexOptions.Compiled | RegexOptions.ExplicitCapture | RegexOptions.IgnoreCase | RegexOptions.IgnorePatternWhitespace | RegexOptions.Singleline); Regex regReplace = new Regex("((<script.*?(/>|</script>))|(<style.*?(/>|</style>))|<.*?>)", RegexOptions.Compiled | RegexOptions.ExplicitCapture | RegexOptions.IgnoreCase | RegexOptions.IgnorePatternWhitespace | RegexOptions.Singleline); String text = regReplace.Replace(regFind.Match(html).Groups["bodyText"].Captures[0].Value, String.Empty); return System.Net.WebUtility.HtmlDecode(text); } |