7.31.2011

C# Station: Fetching Web Pages with HTTP

How To: Fetching Web Pages with HTTP

Introduction

HTTP is the primary transport mechanism for communicating with resources over the World-Wide-Web. A developer will often want to obtain web pages for different reasons to include: search engine page caching, obtaining info on a particular page, or even implementing browser-like capabilities. To help with this task, the .NET Framework includes classes that make this easy.

Getting an HTTP Page

The HTTP classes in the .NET framework are HTTPWebRequest and HTTPWebResponse. The steps involved require specifying a web page to get with a HTTPWebRequest object, performing the actual request, and using a HTTPWebResponse object to receive the page. Thereafter, you would use stream operations to extract page information. Listing 1 demonstrates how this process works.

Listing 1: Getting a Web Page: WebFetch.cs
using System; using System.IO; using System.Net; using System.Text;   ///  /// Fetches a Web Page ///  class WebFetch {  static void Main(string[] args)  {   // used to build entire input   StringBuilder sb  = new StringBuilder();    // used on each read operation   byte[]        buf = new byte[8192];    // prepare the web page we will be asking for   HttpWebRequest  request  = (HttpWebRequest)    WebRequest.Create("http://www.mayosoftware.com");    // execute the request   HttpWebResponse response = (HttpWebResponse)    request.GetResponse();    // we will read data via the response stream   Stream resStream = response.GetResponseStream();    string tempString = null;   int    count      = 0;    do   {    // fill the buffer with data    count = resStream.Read(buf, 0, buf.Length);     // make sure we read some data    if (count != 0)    {     // translate from bytes to ASCII text     tempString = Encoding.ASCII.GetString(buf, 0, count);      // continue building the string     sb.Append(tempString);    }   }   while (count > 0); // any more data to read?    // print out page source   Console.WriteLine(sb.ToString());  } } 

The program in Listing 1 will request the main page of a web site and display the HTML on the console screen. Because the page data will be returned in bytes, we set up a byte array, named buf, to hold results. You'll see how this is used in a couple paragraphs.

The first step in getting a web page is to instantiate a HttpWebRequest object. This occurs when invoking the static Create() method of the WebRequest class. The parameter to the Create() method is a string representing the URL of the web page you want. A similar overload of the Create() method accepts a single Uri type instance. The Create() method returns a WebRequest type, so we need to cast it to an HttpWebRequest type before assigning it to the request variable. Here's the line creating the request object:

  // prepare the web page we will be asking for   HttpWebRequest  request  = (HttpWebRequest)    WebRequest.Create("http://www.mayosoftware.com"); 

Once you have the request object, use that to get a response object. The response object is created by using the GetResponse() method of the request object that was just created. The GetResponse()method does not accept parameters and returns a WebResponse object which must be cast to an HttpWebResponse type before we can assign it to the response object. The following line shows how to obtain the HttpWebResponse object.

     // execute the request   HttpWebResponse response = (HttpWebResponse)    request.GetResponse(); 

The response object is used to obtain a Stream object, which is a member of the System.IO namespace. The GetResponseStream() method of the response instance is invoked to obtain this stream as follows:

  // we will read data via the response stream   Stream resStream = response.GetResponseStream(); 

Remember the byte array we instantiated at the beginning of the algorithm? Now we'll use it in the Read() method, of the stream we just got, to retrieve the web page data. The Read() method accepts three arguments: The first is the byte array to populate, second is the beginning position to begin populating the array, and the third is the maximum number of bytes to read. This method returns the actual number of bytes that were read. Here's how the web page data is read:

  // fill the buffer with data   count = resStream.Read(buf, 0, buf.Length); 

We now have an array of bytes with the web page data in it. However, it is a good idea to transform these bytes into a string. That way we can use all the built-in string manipulation methods available with .NET. I chose to use the static ASCII class of the Encoding class in the System.Text namespace for this task. The ASCII class has a GetString() method which accepts three arguments, similar to theRead() method we just discussed. The first parameter is the byte array to read bytes from, which we pass buf to. Second is the beginning position in buf to begin reading. Third is the number of bytes in bufto read. I passed count, which was the number of bytes returned from the Read() method, as the third parameter, which ensures that only the required number of bytes were read. Here's the code that translates bytes in buf to a string and appends the results to a StringBuilder object.

  // translate from bytes to ASCII text   tempString = Encoding.ASCII.GetString(buf, 0, count);    // continue building the string   sb.Append(tempString); 

The buffer size is set at 8192, but that is only large enough to hold a small web page. To get around this, the code that reads the response stream must be wrapped in a loop that keeps reading until there isn't any more bytes to return. Listing 1 uses a do loop because we have to make at least one read. Recall that every read() returns a count of items that were actually read. The while condition of the do loop checks the count to make sure something was actually read. Also, notice the if statement that makes sure we don't try to translate bytes when nothing was read. Because we used a loop, we needed to collect the results of each iteration, which is why we append the result of each iteration to a StringBuilder.

Summary

The HttpWebRequest and HttpWebResponse classes from the .NET Base Class Library make it easy to request web pages over the internet. The Httprequest object identifies the Web page to get and contains a GetResponse() method for obtaining a HttpWebResponse object. With a HttpWebResponse object, we retrieve a stream to read bytes from. Iterating until all the bytes of a Web page are read, translating bytes to strings, and holding the string, makes it possible to obtain the entire Web page.

Your feedback is very important and I appreciate any constructive contributions you have. Please feel free to contact me for any questions or comments you may have about this article.

No comments: