Automatic Web Page Use With .NET
Written by Ian Elliot   
Wednesday, 28 August 2013
Article Index
Automatic Web Page Use With .NET
Accessing the data
Complete Program Listing

The Login Page

Now we have to set up the event handler and it simply checks that the page has loaded and then uses a select to call the method that is going to process the current page:

void DocumentCompleted(
   object sender,
   WebBrowserDocumentCompletedEventArgs e)
{
 if ( webBrowser1.ReadyState!=
    WebBrowserReadyState.Complete) return;

 switch (page)
 {
  case 1:
   page = 2;
   login();
   break;
  case 2:
   page = 3;
   gotoData();
   break;
  case 3:
   page = 4;
   getData();
   break;
 }
}

You can see that this starts in the way described earlier - i.e. it checks the ready state. Next is checks the value of page and calls the appropriate method to handle that page. You can also see that each time the page value is set to indicate the next page to process. 

The first page we have to deal with is the login page and here we have to fill in a form with the user name and password. You need to find the id of the input fields which are being used for the user name and password. You can do this quite easily using the debug facilities in your favourite browser. Navigate to the page and inspect the elements where you type in your name and password - they should have IDs set. In the case of the page I was working with the IDs were txtUserName and pasPassword. 

While all interaction with a web page that is out of your control is inherently fragile elements that have ids are unlikely to change quickly. The reason is that the ids are used by the server to process the results so changes on the clientside are going to have to go with changes on the serverside. 

The next problem is how to get the elements and enter the data we want to.

This is where the WebBrowser control really becomes an asset. It lets you get at the document in three different ways: as a string, a stream or as an htmldocument. The htmldocument approach allows you to work with elements in the same way as the DOM allows JavaScript to work with the page. In fact if you alreadly know how to work with the DOM in JavaScript you already know a lot about working with the htmldocument object.

For example to get an element by id you simply use the  getElementById method and to modify the elements value you can use the InnerText method. Putting this together the login method starts off:

private void login()
{
 HtmlElement name = webBrowser1.
  Document.
GetElementById("txtUserName");
 if (name != null)
 {
  name.InnerText = "user name";
 }
 HtmlElement password = webBrowser1.
  Document.GetElementById("pasPassword");
 if (password != null)
 {
  password.InnerText = "password";
 }

 

At this point we have the two authentication fields filled in as if the user had typed them. 

Next we need to click the submit button. 

Again we need the button's id and this can be acquired by examining the page using a debugger. In this case the Button's id was discovered to be "login-submit". So now we can find the button very easily but how do we click it?

The answer is that each HtmlElement object has an InvokeMember method. This allows you to execute any method or property that the element has. You can even pass parameters to a method by packing them into an object as properties. In our case all we need to do is invoke the button's click method:

 HtmlElement submit = webBrowser1.Document.
           GetElementById("login-submit");
 if (submit != null)
 {
  submit.InvokeMember("click");
 }
}

 

Getting The Data

Now the next page starts loading in response to the click method submitting the form. With luck the website will correctly validate the username and password and we end up on a new page with a menu menu of options. All we have to do next is navigate to the page with the data on it:

private void gotoData()
{
 webBrowser1.Navigate("URL of data page");
}

As long as this all goes well the next page to be loaded has the words  "used xx.xxGB" where xx.xx is a value that changes and the one we want to extract from the page. In this case there are no obvious elements with unique ids that can be used to locate the text. The simplest approach to getting the data is to do what a human would do and read the text, i.e. create a regular expression.

If we assume that "used" is enough of a target to locate the data we can get away with a simple regular expression:

@"used\s+\d+.\d+"

which specifies the word "used" followed by any number of white space characters, any number of digits, a point and any number of digits. This is simple and vulnerable to going wrong if the web page changes, however making it more specific to the current layout would only make it more susceptible to small changes. When it comes to general web page "scraping" the more general the better is the rule.

One small refinement is that we actually want only the numeric value and not the "used" etc. This can be achieved using a match grouping:

private void getData()
{
 string text = webBrowser1.DocumentText;
 Regex ex1 = new Regex(@"used\s+(\d+.\d+)");
 string used = ex1.Match(text).
                   Groups[1].ToString();

 

In this case we are returning the first matched group from the regular expression result which is a string like 13.34.  Notice that this time we make use of the WebBrowser's DocumentText property to get the HTML page as a string. 

You also need to add:

using System.Text.RegularExpressions;

From here the application can start to process the data and work out things like a prediction for how much data will be used in the period based on the per day usage.

For the example all we will do is show the amount of data used in the Textbox

 textBox1.Text = used;
}

Where Next?

If you want a realtime indicator of the amount of data used then instead of the button we need to run the data collection when the app first loads and then use a timer to run it every hour or so. Notice that the rule is that you should never load a website with more accesses than a human could manage and in this case the data is only updated on the server ever hour or so.

In addition to using a timer it is also a good idea to dispose of the WebBrowser object when it isn't being used. Call 

 webBrowser1.Dispose();

at the end of the data collection.

Another problem for you to solve is making the whole application robust against network problems. At the moment the program works fine as long as the login works first time and all of the pages download without an error. For example to cope with the login problem you need to detect the failure and retry a small number of times. This can be done by setting the page number back to 1 for example - but you also need to count the number of tries and this suggests extending the state variable into a state object. Getting an application two work correctly no matter what the network can throw at it is very difficult. At all times you need to make sure that you're app gives up without bombarding the website with requests.

If you are building an app that will show you your data usage then most of the tricky work is in doing date arithmetic to work out a daily rate and then using that to estimate monthly use. 

 

The complete listing is on the next page.

Related Articles:
Hit Highlighting with dtSearch
Automating applications using messages

 

espbook

 

Comments




or email your comment to: comments@i-programmer.info

 

To be informed about new articles on I Programmer, subscribe to the RSS feed, follow us on Google+Twitter, Linkedin or Facebook or sign up for our weekly newsletter.

Banner


Disk Drive Dangers - SMART and WMI

Getting access from an application to the hardware is never easy. If you want to know how well your disk drive is performing then there is a way of accessing the SMART data - including the temper [ ... ]



AWS Low Cost Mailing List Using phpList And SES

Running a mailing list is not easy or cheap, but if you use AWS it can be. Find out how to create a low-cost and highly effective mailing list server.


Other Projects

 



Last Updated ( Thursday, 16 January 2014 )