Posts

Showing posts with the label python web spider

simple web crawler / scraper tutorial using requests module in python

Let me show you how to use the Requests python module to write a simple web crawler / scraper. So, lets define our problem first. In this page: http://cpbook.subeen.com/p/blog-page_11.html, I am publishing some programming problems. So, now I shall write a script to get the links (url) of the problems. So, lets start. First make sure you can get the content of the page. For this write the following code: import requests def get_page(url):     r = requests.get(url)     print r.status_code     with open("test.html", "w") as fp:         fp.write(r.text)                 if __name__ == "__main__":     url = 'http://cpbook.subeen.com/p/blog-page_11.html'     get_page(url)         Now run the program: $ python cpbook_crawler.py 200 Traceback (most recent call last)...

check status proxy address

Often we need to use proxy addresses in our web spiders / crawler. But most of the time the proxies don't work. So I made a little python program to test the proxy IPs. Let's look into the code: import urllib2, socket socket.setdefaulttimeout(180) # read the list of proxy IPs in proxyList proxyList = ['125.76.226.9:80', '213.55.87.162:6588'] # there are two sample proxy ip for item in proxyList: if is_bad_proxy(item): print "Bad Proxy", item else print item, "is working" def is_bad_proxy(pip): try: proxy_handler = urllib2.ProxyHandler({'http': pip}) opener = urllib2.build_opener(proxy_handler) opener.addheaders = [('User-agent', 'Mozilla/5.0')] urllib2.install_opener(opener) req=urllib2.Request('http://www.your-domain.com') # change the url address here sock=urllib2.urlopen(req) except urllib2.HTTPError,...

Accept-Encoding 'gzip' to make your cralwer faster

Image
If you are looking for a way to make your web crawler / spider work faster, then you must be looking for reducing the time to fetch data from the Internet. Fortunately many websites provide the content in compressed format if the server gets such type of request. If you see the request header of Firefox (you can use firebug to see the request / response headers), you can find that it always asks for gzip content. If the server provides gzip, then it sends data in gzip form otherwise it sends html. gzip data reduces bandwidth and thus make the program to scrape / harvest data faster. So you can do the same from you Python code. Just add ('Accept-Encoding', 'gzip,deflate') in the request header. Check the following code chunk: opener = urllib2.build_opener() opener.addheaders = [('Referer', referer), ('User-Agent', uagent), ('Accept-Encoding', 'gzip,deflate')] usock = opener.open(url) url = usock.geturl() data = decode(usock) usock.c...

Regular Expression not working in scraper?

This is a very common problem for the beginners who try to write web crawler / spider / scraper. The content is fetched but regex is not working right. :( But the problem is not with the regular expression. You just need to add the following two lines after you fetch content of a web page: content = content.replace("\n", "") content = content.replace("\r", "") Now the regex should work if everything else is ok!

get content (html source) of an URL by HTTP POST method in Python

To retrieve or get content (html source) of an URL, sometimes we need to POST some values. Here I show you a sample python code that uses POST. import urllib, urllib2, time url = 'http://www.example.com' # write ur URL here values = {'key1' : 'value1', #write ur specific key/value pair 'key2' : 'value2', 'key3' : 'value3', } try: data = urllib.urlencode(values) req = urllib2.Request(url, data) response = urllib2.urlopen(req) the_page = response.read() print the_page except Exception, detail: print "Err ", detail Hope it will be useful for you while writing web crawler/spider, specially where values must be submitted using HTTP POST method to get or extract content from an URL. Please write your comments.

Python code to scrape email address using regular expression

Programmers when learn writing web spiders or crawlers, try to write a script to parse/collect email address from website. Here I post a class in Python that can harvest email address. It uses regular expression. import re class EmailScraper(): def __init__(self): self.emails = [] def reset(self): self.emails = [] def collectAllEmail(self, htmlSource): "collects all possible email addresses from a string, but still it can miss some addresses" #example: t.s@d.com email_pattern = re.compile("[-a-zA-Z0-9._]+@[-a-zA-Z0-9_]+.[a-zA-Z0-9_.]+") self.emails = re.findall(email_pattern, htmlSource) def collectEmail(self, htmlSource): "collects all emails that starts with mailto: in the html source string" #example: <a href="mailto:t.s@d.com"> email_pattern = re.compile("<a\s+href=\"mailto:([a-zA-Z0-9._@]*)\">", re.IGNORECASE...

How to download a file using Python?

Couple of weeks ago, I had to write a spider that harvest data from a website into a csv file and download the images. First I was thinking how to do the download... then I came up with a simple idea and wrote a function save_image that takes the url of the jpg image and filename, downloads the file and saves it with the name given in filename. import urllib2 def save_image(url, filename):     usock = urllib2.urlopen(url)     data = usock.read()     usock.close()     fp = open(filename, 'wb')     fp.write(data)     fp.close() Actually I just write the file in binary mode. Now post your code that performs this task is a different manner.