crawler controller
I work on a project where I have written 20+ crawlers and the crawlers are running 24/7 (with good amount of sleep). Sometimes, I need to update / restart the server. Then I have to start all the crawlers again. So, I have written a script that will control all the crawlers. It will first check if the crawler is already running, and if not, then it will start the crawler and the crawler will run in the background. I also saved the pid of all the crawlers in a text file so that I can kill a particular crawler immediately when needed.
Here is my code :
import shlex from subprocess import Popen, PIPE site_dt = {'Site1 Name' : ['site1_crawler.py', 'site1_crawler.out'],
'Site2 Name' : ['site2_crawler.py', 'site2_crawler.out']} location = "/home/crawler/" pidfp = open('pid.txt', 'w') def is_running(pname): p1 = Popen(["ps", "ax"], stdout=PIPE) p2 = Popen(["grep", pname], stdin=p1.stdout, stdout=PIPE) p1.stdout.close() # Allow p1 to receive a SIGPIPE if p2 exits. output = p2.communicate()[0] if output.find('/home/crawler/'+pname) > -1: return True return False def main(): for item in site_dt.keys(): print item if is_running(site_dt[item][0]) is True: print site_dt[item][0], "already running" continue cmd = "python " + location + site_dt[item][0] + " -l info" outfile = "log/" + site_dt[item][1] fp = open(outfile, 'w') pid = Popen(shlex.split(cmd), stdout=fp).pid print pid print pidfp.write(item + ": " + pid + "\n") pidfp.close() if __name__ == "__main__": main()
If you feel that there is scope for improvement, please comment.
Comments