Get HTML code from website in C#

Better you can use the Webclient class to simplify your task:

using System.Net;

using (WebClient client = new WebClient())
{
    string htmlCode = client.DownloadString("http://somesite.com/default.html");
}

Getting HTML code from a website. You can use code like this.

string urlAddress = "http://google.com";

HttpWebRequest request = (HttpWebRequest)WebRequest.Create(urlAddress);
HttpWebResponse response = (HttpWebResponse)request.GetResponse();

if (response.StatusCode == HttpStatusCode.OK)
{
  Stream receiveStream = response.GetResponseStream();
  StreamReader readStream = null;

  if (String.IsNullOrWhiteSpace(response.CharacterSet))
     readStream = new StreamReader(receiveStream);
  else
     readStream = new StreamReader(receiveStream, Encoding.GetEncoding(response.CharacterSet));

  string data = readStream.ReadToEnd();

  response.Close();
  readStream.Close();
}

This will give you the returned HTML code from the website. But find text via LINQ is not that easy. Perhaps it is better to use regular expression but that does not play well with HTML code


Best thing to use is HTMLAgilityPack. You can also look into using Fizzler or CSQuery depending on your needs for selecting the elements from the retrieved page. Using LINQ or Regukar Expressions is just to error prone, especially when the HTML can be malformed, missing closing tags, have nested child elements etc.

You need to stream the page into an HtmlDocument object and then select your required element.

// Call the page and get the generated HTML
var doc = new HtmlAgilityPack.HtmlDocument();
HtmlAgilityPack.HtmlNode.ElementsFlags["br"] = HtmlAgilityPack.HtmlElementFlag.Empty;
doc.OptionWriteEmptyNodes = true;

try
{
    var webRequest = HttpWebRequest.Create(pageUrl);
    Stream stream = webRequest.GetResponse().GetResponseStream();
    doc.Load(stream);
    stream.Close();
}
catch (System.UriFormatException uex)
{
    Log.Fatal("There was an error in the format of the url: " + itemUrl, uex);
    throw;
}
catch (System.Net.WebException wex)
{
    Log.Fatal("There was an error connecting to the url: " + itemUrl, wex);
    throw;
}

//get the div by id and then get the inner text 
string testDivSelector = "//div[@id='test']";
var divString = doc.DocumentNode.SelectSingleNode(testDivSelector).InnerHtml.ToString();

[EDIT] Actually, scrap that. The simplest method is to use FizzlerEx, an updated jQuery/CSS3-selectors implementation of the original Fizzler project.

Code sample directly from their site:

using HtmlAgilityPack;
using Fizzler.Systems.HtmlAgilityPack;

//get the page
var web = new HtmlWeb();
var document = web.Load("http://example.com/page.html");
var page = document.DocumentNode;

//loop through all div tags with item css class
foreach(var item in page.QuerySelectorAll("div.item"))
{
    var title = item.QuerySelector("h3:not(.share)").InnerText;
    var date = DateTime.Parse(item.QuerySelector("span:eq(2)").InnerText);
    var description = item.QuerySelector("span:has(b)").InnerHtml;
}

I don't think it can get any simpler than that.

Tags:

C#

Html

Linq