Webbots, Spiders, and Screen Scrapers |
Author: Michael Schrenk It's important to note the subtitle, "A Guide to Developing Internet Agents with PHP/CURL". Author: Michael Schrenk In this age of HTML5 and the semantic web it is surprising that we have to even consider such low level ways of interacting with web pages as bots, spiders and scrapers - but we do! This book promises to explain how it all works. In principle the web should be user and program friendly. After all, a web page is a document marked up precisely with HTML. It should be possible for a program to read the web page, parse it, make sense of it and interact with it. Of course, in practice this is more difficult for a range of reasons - incorrect HTML, JavaScript/Ajax and all manner of plug-ins. It is difficult to pretend to be a browser. This book is one of the few that attempts to gather together the range of techniques that you need to write programs that work with web sites intended to be used by humans. Perhaps the most important point is that the book makes use of PHP and the CURL library for all its examples. It also doesn't spend very much time on explaining either the language or the library.
Part I of the book is on fundamental concepts and techniques. Chapters 1 and 2 are fairly missable in that they discuss ethics and what you might do with a webbot. Chapter 3 is where things really start with a look at the simple task of downloading a web page. This is beginners' stuff but if you don't know how to download a web page as a file in PHP then it is essential knowledge. This is also where we get a basic introduction to the CURL library. Chapter 4 deals with parsing HTML using simple PHP functions and chapter 6 extends this to regular expressions. Chapter 6 is on the very important topic of automating web form submission. This is a very common task for any bot interacting with a web site and the difficulty can range from easy to very difficult. After explaining the basics of form submission, the chapter goes on to explain how things can go wrong. Chapter 7 deals with handling the large amounts of data a bot can end up gathering. It is essentially a discussion of creating file formats and using a database. You might well know most of this already as it is a fairly general programming topic. Part II of the book is just a collection of projects - a price monitoring bot, an image capturing bot, a link verifier, a search ranking bot, an aggregator, an FTP bot, an email reader, an email sender and a bot that converts a website into a PHP function. All of the projects are well described and they are all fairly simple. If you followed the discussion in the first part of the book and have been programming for a while then you should be capable of creating any of the examples. Part III is about advanced topics. It opens with Chapter 17 on spiders - i.e. bots that follow the URLs in the web pages that they examine. Spiders are mostly difficult because you have to decide which links to follow and have some sort of cut off to stop the process going on forever. The discussion in this chapter is enough to get you started but no more. In practice you are going to have to do a lot more work to get something practical and you are going to have to link your spider to a database - a topic not covered. Chapter 18 is about procurement and sniper bots - i.e. the sort of thing you might use to automate bidding in an auction. The chapter does little more than explain the theory. When you think about it however the task is a difficult one and the more so because of the difficulty of testing anything you create. Can you imagine losing that eBay item just because your bot made a mistake? Chapters 19 and 20 are an overview of cryptography and authentication. A bit too short and basic but again enough to get you started. Chapter 21 is on cookies and again it is more a sketch of the difficulties you are going to encounter. Chapter 22 moves on to scheduling bots and it just a look at Windows scheduling. If you are using other operating systems then you will have to look at the documentation. For Chapter 23 we have a new idea - why not use a browser to automate things via a macro. Unfortunately the best macro language we have is iMacro and this isn't particularly impressive. The chapter explains its weakness, but doesn't do much to help put it right. Chapter 24 goes deeper into using iMacro but probably not deep enough to solve all of its problems - but it is still useful to learn how to autorun a macro. The final chapter of the section is on deployment and scaling. This is a very difficult subject and the chapter really only gets you started. Part IV is titled "Larger Considerations" which is a bit mysterious. It turns out to be a discussion of how to hide your bot and how to harden your bot. Chapter 26 is about being stealthy - don't run at the same time of day for example. Chapter 27 is about using a proxy to hide your identity. Chapter 28 is about fault tolerance. The final three chapters are about making your own site bot friendly, how to make life difficult for bots and a bit about the law. Overall this is an easy to read book that describes many of the basic ideas quite well. However, the problem is that a production level bot/spider is a difficult program to create. It not only requires simple programming techniques but an overall system architecture that allows it to run multiple copies with huge resources and lots of bandwidth. You probably need to either invest in a data center or your own or rent time on something like Amazon's AWS. In this case you have a system to put together and your bot needs to interact with a shared database. This is another notch up on what is described in this book. However, many programmers just want to implement a simple bot to help fill in a form or automate some repetitive task . In this case, as long as you are happy with PHP/CURL, then the book covers its subject quite well. As long as you fit into this category you will find the book useful.
|
|||
Last Updated ( Thursday, 26 April 2012 ) |