Posts

Showing posts from July, 2008

Strip HTML tags using Python

We often need to strip HTML tags from string (or HTML source). I usually do it using a simple regular expression in Python. Here is my function to strip HTML tags: def remove_html_tags(data): p = re.compile(r'<.*?>') return p.sub('', data) Here is another function to remove more than one consecutive white spaces: def remove_extra_spaces(data): p = re.compile(r'\s+') return p.sub(' ', data) Note that re module needs to be imported in order to use regular expression. Here you can find an updated code that gets the text from html: http://love-python.blogspot.com/2011/04/html-to-text-in-python.html

HTTP Response Header Information in Python

Sometimes we need to get the HTTP Response header from our program. You can check the following python code, that prints the http response header: >>> import urllib2 >>> usock = urllib2.urlopen('http://bdosn.org') >>> print usock.info() Date: Thu, 24 Jul 2008 19:52:53 GMT Server: Apache/1.3.39 (Unix) PHP/5.2.3 mod_auth_passthrough/1.8 mod_log_bytes/1.2 mod_bwlimited/1.4 mod_gzip/1.3.26.1a FrontPage/5.0.2.2634a mod_ssl/2.8.30 OpenSSL/0.9.7a X-Powered-By: PHP/5.2.3 Connection: close Transfer-Encoding: chunked Content-Type: text/html >>> If you want to get the value of a particular item, use the get() method: >>> print usock.info().get('Date') Thu, 24 Jul 2008 19:52:53 GMT >>> Hope you will find this tips useful!

check status proxy address

Often we need to use proxy addresses in our web spiders / crawler. But most of the time the proxies don't work. So I made a little python program to test the proxy IPs. Let's look into the code: import urllib2, socket socket.setdefaulttimeout(180) # read the list of proxy IPs in proxyList proxyList = ['125.76.226.9:80', '213.55.87.162:6588'] # there are two sample proxy ip for item in proxyList: if is_bad_proxy(item): print "Bad Proxy", item else print item, "is working" def is_bad_proxy(pip): try: proxy_handler = urllib2.ProxyHandler({'http': pip}) opener = urllib2.build_opener(proxy_handler) opener.addheaders = [('User-agent', 'Mozilla/5.0')] urllib2.install_opener(opener) req=urllib2.Request('http://www.your-domain.com') # change the url address here sock=urllib2.urlopen(req) except urllib2.HTTPError,

Accept-Encoding 'gzip' to make your cralwer faster

Image
If you are looking for a way to make your web crawler / spider work faster, then you must be looking for reducing the time to fetch data from the Internet. Fortunately many websites provide the content in compressed format if the server gets such type of request. If you see the request header of Firefox (you can use firebug to see the request / response headers), you can find that it always asks for gzip content. If the server provides gzip, then it sends data in gzip form otherwise it sends html. gzip data reduces bandwidth and thus make the program to scrape / harvest data faster. So you can do the same from you Python code. Just add ('Accept-Encoding', 'gzip,deflate') in the request header. Check the following code chunk: opener = urllib2.build_opener() opener.addheaders = [('Referer', referer), ('User-Agent', uagent), ('Accept-Encoding', 'gzip,deflate')] usock = opener.open(url) url = usock.geturl() data = decode(usock) usock.c

Using Python to interact with MySQL

Image
I have to use MySQL database frequently. Interacting with MySQL is easy and simple. There is a nice module named MySQLdb is available for this purpose. I am not writing details about using it as there is a good tutorial " Writing MySQL Scripts with Python DB-API " available online. I myself read this when I used MySQL with Python for the first time. MySQL Cookbook also covers database programming in Python and other languages as well.

Regular Expression not working in scraper?

This is a very common problem for the beginners who try to write web crawler / spider / scraper. The content is fetched but regex is not working right. :( But the problem is not with the regular expression. You just need to add the following two lines after you fetch content of a web page: content = content.replace("\n", "") content = content.replace("\r", "") Now the regex should work if everything else is ok!