Posts

Showing posts with the label web spider

Use user agent in your spider

Some websites don't allow your spider to scrape the pages unless you use an user-agent in your code. You can fool the websites using user-agent so that they understand that the request is coming from a browser. Here is a piece of code that use user agent 'Mozilla 5.0' to get the html content of a website: import urllib2 url = "http://www.example.com" #write your url here opener = urllib2.build_opener() opener.addheaders = [('User-agent', 'Mozilla/5.0')] usock = opener.open(url) url = usock.geturl() data = usock.read() usock.close() print data You can use other user agent as well. For example, the user agent my Firefox browser uses: "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.1.12) Gecko/20061201 Firefox/2.0.0.12 (Ubuntu-feisty)" What is your user agent?

Get Original URL

Once I got into trouble while crawling some websites. Some of the URL I had wasn't the original URL, rather they were redirecting to some other URL. Then I came up with a function to get the original URL. Here I share it with you: def get_original_url(url): """This function takes an url and returns the original url with cookie (if any) """     cj = cookielib.CookieJar()     opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))     opener.addheaders = [('User-agent', 'Mozilla/5.0')]     usock = opener.open(url)     url = usock.geturl()     usock.close()     return url, cj Please send me your comments on this piece of code.

set timeout while spidering a site

Though I heavily depend on urllib2 module to develop web crawler, but sometimes the crawlers just stuck ... :(. So it's necessary to set a timeout but unfortunately urllib2 doesn't provide anything for this purpose. So we have to depend on socket module. here is the code that I use: import socket timeout = 300 # seconds socket.setdefaulttimeout(timeout)

get html source of an URL

I have been using Python to write web crawler/spider/scraper for a long time. And it's an interesting experience indeed. The good news is, I have decided to share my web crawler experience with you. I shall use the terms crawler, spider, scraper alternatively. The most basic thing to write a web spider is to get the html source (i.e. content) of an URL. There are many ways to do it. Here I post a simple code that gets the html source from an url. import urllib2 url = 'http://abc.com' # write the url here usock = urllib2.urlopen(url) data = usock.read() usock.close() print data Urllib2 is a very useful module for the spiderman ;) so take a look at the documentation http://www.python.org/doc/current/lib/module-urllib2.html