Use user agent in your spider

Some websites don't allow your spider to scrape the pages unless you use an user-agent in your code. You can fool the websites using user-agent so that they understand that the request is coming from a browser. Here is a piece of code that use user agent 'Mozilla 5.0' to get the html content of a website:


import urllib2

url = "http://www.example.com" #write your url here
opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
usock = opener.open(url)
url = usock.geturl()
data = usock.read()
usock.close()
print data


You can use other user agent as well. For example, the user agent my Firefox browser uses:
"Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.1.12) Gecko/20061201 Firefox/2.0.0.12 (Ubuntu-feisty)"

What is your user agent?

Comments

sajid said…
my user agent (fedora 8) is :
Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.1.8) Gecko/20071030 Fedora/2.0.0.8-2.fc8 Firefox/2.0.0.8

Popular posts from this blog

Strip HTML tags using Python

lambda magic to find prime numbers

Convert text to ASCII and ASCII to text - Python code