Use Proxy in your Spider

Using proxy, you can minimize the chance of getting blocked for your crawlers/spiders. Now let me tell you how to use proxy ip address in your spider in python. First load the list from a file:

fileproxylist = open('proxylist.txt', 'r')
proxyList = fileproxylist.readlines()
indexproxy = 0
totalproxy = len(proxyList)

Now for each proxy in list, call the following function:

def get_source_html_proxy(url, pip):

    proxy_handler = urllib2.ProxyHandler({'http': pip})
    opener = urllib2.build_opener(proxy_handler)
    opener.addheaders = [('User-agent', 'Mozilla/5.0')]
    urllib2.install_opener(opener)
    req=urllib2.Request(url)
    sock=urllib2.urlopen(req)
    data = sock.read()
    return data

Hope your spidering experience will be better with proxies :-)

You can use this list to test your crawler.

Comments

Hi
Very interesting article!
How do I know if a proxy is not responding? I mean, is there a simple way to set a timeout?
Thanks!
Nicola
Tamim Shahriar said…
Yes, you can set timeout. Check this: http://love-python.blogspot.com/2008/02/set-timeout-while-spidering-site.html

Thanks.
THANKS!!!
I'm definitely trying that.
thanks again
Nicola
Unknown said…
This script picks a random proxy from a list, then changes your IP so you can spider it up right?
Tamim Shahriar said…
@Heather, you are almost correct. You can choose the proxy one by one from the list or can take a random proxy. And using proxy you can hide your IP. Useful for the spiders. :)
Unknown said…
So how exactly do i get it to choose a random proxy every time

I dont get the def url pip
Kamil said…
Thank you! Now I can scrape sites from google without ban for ip :D
Alock Roy said…
Hi,
I want to scrape a page using python. But the problem is when i want to go next page i have to submit a form. This form is submitted with 10 hidden values. So how i submit the form programmatically. The link is "https://www.jobs.lbhf.gov.uk/paplve_webrecruitment/wrd/run/ETREC106GF.display_srch_all?WVID=52561500BT&LANG=USA" . Please give me some suggestion or guide line.
Thanks
Sushanta
Alock Roy said…
hi,
i want to scrape a site with python but the problem is that when i want to go the page i want to submit a form. This form is submitted with 10 hidden values. So how i submit the form programmatically. The link is https://www.jobs.lbhf.gov.uk/paplve_webrecruitment/wrd/run/ETREC106GF.display_srch_all?WVID=52561500BT&LANG=USA.please give me some suggestion or guide line.
thanks

Popular posts from this blog

Strip HTML tags using Python

lambda magic to find prime numbers

Convert text to ASCII and ASCII to text - Python code