Use Proxy in your Spider

Using proxy, you can minimize the chance of getting blocked for your crawlers/spiders. Now let me tell you how to use proxy ip address in your spider in python. First load the list from a file:

fileproxylist = open('proxylist.txt', 'r')
proxyList = fileproxylist.readlines()
indexproxy = 0
totalproxy = len(proxyList)

Now for each proxy in list, call the following function:

def get_source_html_proxy(url, pip):

    proxy_handler = urllib2.ProxyHandler({'http': pip})
    opener = urllib2.build_opener(proxy_handler)
    opener.addheaders = [('User-agent', 'Mozilla/5.0')]
    urllib2.install_opener(opener)
    req=urllib2.Request(url)
    sock=urllib2.urlopen(req)
    data = sock.read()
    return data

Hope your spidering experience will be better with proxies :-)

You can use this list to test your crawler.

Comments

Hi
Very interesting article!
How do I know if a proxy is not responding? I mean, is there a simple way to set a timeout?
Thanks!
Nicola
Tamim Shahriar said…
Yes, you can set timeout. Check this: http://love-python.blogspot.com/2008/02/set-timeout-while-spidering-site.html

Thanks.
THANKS!!!
I'm definitely trying that.
thanks again
Nicola
Unknown said…
This script picks a random proxy from a list, then changes your IP so you can spider it up right?
Tamim Shahriar said…
@Heather, you are almost correct. You can choose the proxy one by one from the list or can take a random proxy. And using proxy you can hide your IP. Useful for the spiders. :)
Unknown said…
So how exactly do i get it to choose a random proxy every time

I dont get the def url pip
Kamil said…
Thank you! Now I can scrape sites from google without ban for ip :D
Alock Roy said…
Hi,
I want to scrape a page using python. But the problem is when i want to go next page i have to submit a form. This form is submitted with 10 hidden values. So how i submit the form programmatically. The link is "https://www.jobs.lbhf.gov.uk/paplve_webrecruitment/wrd/run/ETREC106GF.display_srch_all?WVID=52561500BT&LANG=USA" . Please give me some suggestion or guide line.
Thanks
Sushanta
Alock Roy said…
hi,
i want to scrape a site with python but the problem is that when i want to go the page i want to submit a form. This form is submitted with 10 hidden values. So how i submit the form programmatically. The link is https://www.jobs.lbhf.gov.uk/paplve_webrecruitment/wrd/run/ETREC106GF.display_srch_all?WVID=52561500BT&LANG=USA.please give me some suggestion or guide line.
thanks

Popular posts from this blog

lambda magic to find prime numbers

Strip HTML tags using Python

Accept-Encoding 'gzip' to make your cralwer faster