Friday, May 11, 2007

Book Review: Webbots, Spiders and Screen Scrapers by Michael Schrenk

by T. Michael Testi (Blogcritics.org , PhotographyToday, ATAEE)

According to the Michael Schrenk, the internet is bigger and better than what a mere browser allows. Webbots, Spiders and Screen Scrapers was written to show you how to take advantage of the vast resources available on the internet. When you are regulated to the world of a browser, you are limited in what is available to you. Webbots, Spiders and Screen Scrapers goal is to open up the Web and enhance you online experience.

What is the problem with browsers? It is a manual tool that downloads and renders websites. You still need to decide if the website is relevant to you. Your browser cannot think. It cannot anticipate your actions and won't notify you when something important happens. To accomplish this, you will need the automation and intelligence only available in a webbot; also known as a web robot.

Webbots, Spiders and Screen Scrapers contains 28 chapters that break down into four sections. I will focus on the four sections highlighting the chapters as needed. What you will need to work with this book is a fundamental understanding of HTML, and how the internet works. It should be known that this book is not going to teach you how to program, or how things like TCP/IP; the protocol of the internet work. Pretty much any kind of Pentium computer running Windows, Linux or Mac operating system will do. You will also want to get PHP, cURL and MySQL, all of which are free on the internet. Again, this book will not teach you how to use these products, but rather use these products to teach you how to create webbots, spiders and screen scrapers.

Part one, "Fundamental Concepts and Techniques," introduces the concepts of web automation and explores the elementary techniques that will allow you to harness the resources of the web. It begins by explaining why it is fun to write webbots and how writing webbots can be a rewarding career. It tells where you can get ideas for webbot projects and talks about existing as well as potential webbot projects. You will learn how to download web pages, parse those pages, automatically submit forms, and how to manage large amounts of data. All of these topics will set you up for the rest of the book.

Part two, "Projects," expands on the concepts that you learned in part one. According to the author, with further development, any of these projects could be transformed in to a marketable product. The projects include; Price-monitoring webbots where you can collect and analyze online prices from any number of websites. There is an image-capturing webbots that will download all of the images from a website as well as a Email reading an email sending webbot. All in all there are eleven projects included in part two.

Part three, "Advanced Technical Considerations," explores the finer technical aspects of webbots and spider development. Here the author shares some hard learned lessons while teaching you how to write some specialized webbots and spiders. Here you will learn about spiders; a webbot that finds and follow links both within a website as well as those that crawl along the web searching out specific information. You will learn how to create snippers; webbots that automatically purchase items from places like auction sites when a specific set of criterion has been met. You will also find out how to deal with cryptography, authentication and scheduling.

Part four, "Larger Considerations," advises you on being socially responsible with your webbot development. Webbots need to coexist with not only society, but with the system administrators of the sites that you target. You will learn how to create webbots with stealth; webbots that look like normal browser traffic. You will also learn how to write fault tolerant webbots. Because the web changes constantly, your webbots will need to handle these changes. You will also learn how to create webbot friendly websites, writing pages that can be made secure from other spiders to protect sensitive data and how to keep your own webbots out of trouble.

Webbots, Spiders and Screen Scrapers is a great book for learning about webbots and spiders and how they work on the web. It will introduce you to a wide variety of topics via the projects and you will be able to understand how the different technologies work.

I have one main complaint and one warning. First the complaint, in chapter 27, "Killing Spiders," we are told about robots.txt which is meant to keep spiders from trampling sensitive pages. Then we are told that we really shouldn't use them because that the use of Robots.txt is strictly voluntary, and that it might alert others to where the sensitive material are. I felt confused. If I don't use a robots.txt, other webbots will be trampling my site and if I do use it, I leave my self open for attack.

The warning begins on page four: "You may use any of the scripts in this book for your own personal use, as long as you agree not to redistribute them... and agree not to sell or create derivative products under any circumstances." This is total contradictory to what the author says on page 75; "Any of these projects, with further development, could be transformed from a simple webbot concept into a potentially marketable product." I'll leave it to you to determine what he means.

Webbots, Spiders and Screen Scrapers, is a great book for the beginner to intermediate user of PHP and for someone who wants to get into web agents. It is a good overview and tutorial on the topics of webbots, spiders and screen scrapers and will have you understanding and programming them in no time.

No comments: