Posts

Showing posts from March, 2008

Leap Year in Python

Year 2008 is a leap year. The leap year thing is always interesting. Sometimes I also think about the unfortunate people who have birthday on 29th February! :) Let me write two modules that determines whether a year is leap year or not, (I also write test module for this). def is_leap_year2(year):     if year % 400 == 0:         return True     elif year % 100 == 0:         return False     elif year % 4 == 0:         return True     else:     return False def is_leap_year(year):     if year % 100 != 0 and year % 4 == 0:         return True     elif year % 100 == 0 and year % 400 == 0:         return True     else:         return False def test_is_leap_year():     years = [1900, 2000, 2001, 2002, 2020, 2008, 2010]     for year in years:         if is_leap_year2(year):             print year, "is leap year"         else:             print year, "is not a leap year" #program starts from here test_is_leap_year() After writing the code I found a nice discussion in pytho

Use Beautiful Soup for screen scraping in Python

Tired of writing web spiders/crawler/scrapers? You can try Beautiful Soup in Python. I also have decided to use it in my next spiders. From their website : Beautiful Soup is a Python HTML/XML parser designed for quick turnaround projects like screen-scraping. Three features make it powerful: 1. Beautiful Soup won't choke if you give it bad markup. It yields a parse tree that makes approximately as much sense as your original document. This is usually good enough to collect the data you need and run away. 2. Beautiful Soup provides a few simple methods and Pythonic idioms for navigating, searching, and modifying a parse tree: a toolkit for dissecting a document and extracting what you need. You don't have to create a custom parser for each application. 3. Beautiful Soup automatically converts incoming documents to Unicode and outgoing documents to UTF-8. You don't have to think about encodings, unless the document doesn't specify an encoding and Beautiful Soup c

Use Proxy in your Spider

Using proxy, you can minimize the chance of getting blocked for your crawlers/spiders. Now let me tell you how to use proxy ip address in your spider in python. First load the list from a file: fileproxylist = open('proxylist.txt', 'r') proxyList = fileproxylist.readlines() indexproxy = 0 totalproxy = len(proxyList) Now for each proxy in list, call the following function: def get_source_html_proxy(url, pip):     proxy_handler = urllib2.ProxyHandler({'http': pip})     opener = urllib2.build_opener(proxy_handler)     opener.addheaders = [('User-agent', 'Mozilla/5.0')]     urllib2.install_opener(opener)     req=urllib2.Request(url)     sock=urllib2.urlopen(req)     data = sock.read()     return data Hope your spidering experience will be better with proxies :-) You can use this list to test your crawler.

Install IDLE in Linux

IDLE is a very nice and common tool for python programming. In windows, it comes with default python installation. But in Linux it doesn't come built in. So in Ubuntu I used Synaptic Package Manager to download/install IDLE. Fedora systems comes with Python but dont come with IDLE.. In fact if you search for a package or rpm named idle, you wont find that. IDLE is in python-tools package. So the following command (as root) will install IDLE in Fedora or other RedHat systems. yum install python-tools To download Python: http://python.org/download/ Enjoy your ride on Python !

run python script as a background process in linux

So, you have a server to which you connect remotely, upload a python script and want to run it and logout from the server keeping the program running. If you frequently work with spiders, you surely want to do it. But how to do it? For example if your script's name is script.py, then the command is: nohup python script.py & And sometimes you may be interested to see the output is that being generated. Then you should view the nohup.out file! This command can be useful: tail -f nohup.out So run the program in background and enjoy your time :-)

RentACoder.com - another place to look for freelance Python projects

Yes, in rentacoder.com you will find lots of projects including some python projects. If you are serious about freelance work, I think you should visit here for a good start. I have seen lots of people get interested in freelance works, registers themselves and get frustrated as they can't win any bid. So read the blog post and get some ideas. As the writer in the blog explained everything step-by-step. Happy bidding!

What is your OS?

Another poll ended. This was the 2nd poll in this site. I asked my visitors about their operating system. In reply I got that 43% uses Windows, 37% Linux, 12% MAC. Now another poll is running. It's about your browser. Please vote.

extract domain name from url

Sometimes I need to find domain name from url in my program for various purposes (most of the time in my crawlers). So far I used the following function that takes an url and returns the domain name: def find_domain(url):     pos = url[7:].find('/')     if pos == -1:         pos = url[7:].find('?')         if pos == -1:             return url[7:]         url = url[7:(7+pos)]         return url But today I found a module named urlparse. So my function now looks like this: def find_domain2(url):     return urlparse(url)[1] The new one is much better I think. Check urlparse for details.