life is short - you need Python!

Posts

Showing posts with the label web spider

Use user agent in your spider

February 23, 2008

Some websites don't allow your spider to scrape the pages unless you use an user-agent in your code. You can fool the websites using user-agent so that they understand that the request is coming from a browser. Here is a piece of code that use user agent 'Mozilla 5.0' to get the html content of a website: import urllib2 url = "http://www.example.com" #write your url here opener = urllib2.build_opener() opener.addheaders = [('User-agent', 'Mozilla/5.0')] usock = opener.open(url) url = usock.geturl() data = usock.read() usock.close() print data You can use other user agent as well. For example, the user agent my Firefox browser uses: "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.1.12) Gecko/20061201 Firefox/2.0.0.12 (Ubuntu-feisty)" What is your user agent?

Get Original URL

February 19, 2008

Once I got into trouble while crawling some websites. Some of the URL I had wasn't the original URL, rather they were redirecting to some other URL. Then I came up with a function to get the original URL. Here I share it with you: def get_original_url(url): """This function takes an url and returns the original url with cookie (if any) """ cj = cookielib.CookieJar() opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj)) opener.addheaders = [('User-agent', 'Mozilla/5.0')] usock = opener.open(url) url = usock.geturl() usock.close() return url, cj Please send me your comments on this piece of code.

set timeout while spidering a site

February 12, 2008

Though I heavily depend on urllib2 module to develop web crawler, but sometimes the crawlers just stuck ... :(. So it's necessary to set a timeout but unfortunately urllib2 doesn't provide anything for this purpose. So we have to depend on socket module. here is the code that I use: import socket timeout = 300 # seconds socket.setdefaulttimeout(timeout)

get html source of an URL

February 12, 2008

I have been using Python to write web crawler/spider/scraper for a long time. And it's an interesting experience indeed. The good news is, I have decided to share my web crawler experience with you. I shall use the terms crawler, spider, scraper alternatively. The most basic thing to write a web spider is to get the html source (i.e. content) of an URL. There are many ways to do it. Here I post a simple code that gets the html source from an url. import urllib2 url = 'http://abc.com' # write the url here usock = urllib2.urlopen(url) data = usock.read() usock.close() print data Urllib2 is a very useful module for the spiderman ;) so take a look at the documentation http://www.python.org/doc/current/lib/module-urllib2.html