Posts

Showing posts with the label crawler

crawler controller

I work on a project where I have written 20+ crawlers and the crawlers are running 24/7 (with good amount of sleep). Sometimes, I need to update / restart the server. Then I have to start all the crawlers again. So, I have written a script that will control all the crawlers. It will first check if the crawler is already running, and if not, then it will start the crawler and the crawler will run in the background. I also saved the pid of all the crawlers in a text file so that I can kill a particular crawler immediately when needed. Here is my code : import shlex from subprocess import Popen, PIPE site_dt = {'Site1 Name' : ['site1_crawler.py', 'site1_crawler.out'],  'Site2 Name' : ['site2_crawler.py', 'site2_crawler.out']} location = "/home/crawler/" pidfp = open('pid.txt', 'w') def is_running(pname): p1 = Popen(["ps", "ax"], stdout=PIPE) p2 = Popen(["grep", pname...

simple web crawler / scraper tutorial using requests module in python

Let me show you how to use the Requests python module to write a simple web crawler / scraper. So, lets define our problem first. In this page: http://cpbook.subeen.com/p/blog-page_11.html, I am publishing some programming problems. So, now I shall write a script to get the links (url) of the problems. So, lets start. First make sure you can get the content of the page. For this write the following code: import requests def get_page(url):     r = requests.get(url)     print r.status_code     with open("test.html", "w") as fp:         fp.write(r.text)                 if __name__ == "__main__":     url = 'http://cpbook.subeen.com/p/blog-page_11.html'     get_page(url)         Now run the program: $ python cpbook_crawler.py 200 Traceback (most recent call last)...