Posts

Showing posts with the label python web crawler

check status proxy address

Often we need to use proxy addresses in our web spiders / crawler. But most of the time the proxies don't work. So I made a little python program to test the proxy IPs. Let's look into the code: import urllib2, socket socket.setdefaulttimeout(180) # read the list of proxy IPs in proxyList proxyList = ['125.76.226.9:80', '213.55.87.162:6588'] # there are two sample proxy ip for item in proxyList: if is_bad_proxy(item): print "Bad Proxy", item else print item, "is working" def is_bad_proxy(pip): try: proxy_handler = urllib2.ProxyHandler({'http': pip}) opener = urllib2.build_opener(proxy_handler) opener.addheaders = [('User-agent', 'Mozilla/5.0')] urllib2.install_opener(opener) req=urllib2.Request('http://www.your-domain.com') # change the url address here sock=urllib2.urlopen(req) except urllib2.HTTPError,...

Accept-Encoding 'gzip' to make your cralwer faster

Image
If you are looking for a way to make your web crawler / spider work faster, then you must be looking for reducing the time to fetch data from the Internet. Fortunately many websites provide the content in compressed format if the server gets such type of request. If you see the request header of Firefox (you can use firebug to see the request / response headers), you can find that it always asks for gzip content. If the server provides gzip, then it sends data in gzip form otherwise it sends html. gzip data reduces bandwidth and thus make the program to scrape / harvest data faster. So you can do the same from you Python code. Just add ('Accept-Encoding', 'gzip,deflate') in the request header. Check the following code chunk: opener = urllib2.build_opener() opener.addheaders = [('Referer', referer), ('User-Agent', uagent), ('Accept-Encoding', 'gzip,deflate')] usock = opener.open(url) url = usock.geturl() data = decode(usock) usock.c...

Regular Expression not working in scraper?

This is a very common problem for the beginners who try to write web crawler / spider / scraper. The content is fetched but regex is not working right. :( But the problem is not with the regular expression. You just need to add the following two lines after you fetch content of a web page: content = content.replace("\n", "") content = content.replace("\r", "") Now the regex should work if everything else is ok!

get content (html source) of an URL by HTTP POST method in Python

To retrieve or get content (html source) of an URL, sometimes we need to POST some values. Here I show you a sample python code that uses POST. import urllib, urllib2, time url = 'http://www.example.com' # write ur URL here values = {'key1' : 'value1', #write ur specific key/value pair 'key2' : 'value2', 'key3' : 'value3', } try: data = urllib.urlencode(values) req = urllib2.Request(url, data) response = urllib2.urlopen(req) the_page = response.read() print the_page except Exception, detail: print "Err ", detail Hope it will be useful for you while writing web crawler/spider, specially where values must be submitted using HTTP POST method to get or extract content from an URL. Please write your comments.

Python code to scrape email address using regular expression

Programmers when learn writing web spiders or crawlers, try to write a script to parse/collect email address from website. Here I post a class in Python that can harvest email address. It uses regular expression. import re class EmailScraper(): def __init__(self): self.emails = [] def reset(self): self.emails = [] def collectAllEmail(self, htmlSource): "collects all possible email addresses from a string, but still it can miss some addresses" #example: t.s@d.com email_pattern = re.compile("[-a-zA-Z0-9._]+@[-a-zA-Z0-9_]+.[a-zA-Z0-9_.]+") self.emails = re.findall(email_pattern, htmlSource) def collectEmail(self, htmlSource): "collects all emails that starts with mailto: in the html source string" #example: <a href="mailto:t.s@d.com"> email_pattern = re.compile("<a\s+href=\"mailto:([a-zA-Z0-9._@]*)\">", re.IGNORECASE...

extract domain name from url

Sometimes I need to find domain name from url in my program for various purposes (most of the time in my crawlers). So far I used the following function that takes an url and returns the domain name: def find_domain(url):     pos = url[7:].find('/')     if pos == -1:         pos = url[7:].find('?')         if pos == -1:             return url[7:]         url = url[7:(7+pos)]         return url But today I found a module named urlparse. So my function now looks like this: def find_domain2(url):     return urlparse(url)[1] The new one is much better I think. Check urlparse for details.

How to download a file using Python?

Couple of weeks ago, I had to write a spider that harvest data from a website into a csv file and download the images. First I was thinking how to do the download... then I came up with a simple idea and wrote a function save_image that takes the url of the jpg image and filename, downloads the file and saves it with the name given in filename. import urllib2 def save_image(url, filename):     usock = urllib2.urlopen(url)     data = usock.read()     usock.close()     fp = open(filename, 'wb')     fp.write(data)     fp.close() Actually I just write the file in binary mode. Now post your code that performs this task is a different manner.