Use user agent in your spider

February 23, 2008

Some websites don't allow your spider to scrape the pages unless you use an user-agent in your code. You can fool the websites using user-agent so that they understand that the request is coming from a browser. Here is a piece of code that use user agent 'Mozilla 5.0' to get the html content of a website:


import urllib2

url = "http://www.example.com" #write your url here
opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
usock = opener.open(url)
url = usock.geturl()
data = usock.read()
usock.close()               
print data

You can use other user agent as well. For example, the user agent my Firefox browser uses:
"Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.1.12) Gecko/20061201 Firefox/2.0.0.12 (Ubuntu-feisty)"

What is your user agent?

Comments

sajid said…

my user agent (fedora 8) is :
Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.1.8) Gecko/20071030 Fedora/2.0.0.8-2.fc8 Firefox/2.0.0.8

February 23, 2008 at 12:14 PM

Search This Blog

life is short - you need Python!

Use user agent in your spider

Comments

Popular posts from this blog

Python all any built-in function

Accept-Encoding 'gzip' to make your cralwer faster

lambda magic to find prime numbers