simple web crawler / scraper tutorial using requests module in python

Let me show you how to use the Requests python module to write a simple web crawler / scraper. So, lets define our problem first.

In this page: http://cpbook.subeen.com/p/blog-page_11.html, I am publishing some programming problems. So, now I shall write a script to get the links (url) of the problems.

So, lets start.

First make sure you can get the content of the page. For this write the following code:

import requests

def get_page(url):
    r = requests.get(url)
    print r.status_code
    with open("test.html", "w") as fp:
        fp.write(r.text)
       
       
if __name__ == "__main__":
    url = 'http://cpbook.subeen.com/p/blog-page_11.html'
    get_page(url)        


Now run the program:

$ python cpbook_crawler.py
200
Traceback (most recent call last):
  File "cpbook_crawler.py", line 15, in
    get_page(url)       
  File "cpbook_crawler.py", line 10, in get_page
    fp.write(r.text)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 1642-1643: ordinal not in range(128)
 


Hmm... we got an error. It can be fixed using the encode() function. So here is our updated code:

import requests

def get_page(url):
    r = requests.get(url)
    print r.status_code
    content = r.text.encode('utf-8', 'ignore')
    with open("test.html", "w") as fp:
        fp.write(content)
       
       
if __name__ == "__main__":
    url = 'http://cpbook.subeen.com/p/blog-page_11.html'
    get_page(url)           

Now you can open the test.html file in a browser and also in a text editor to check the file. After a close look at the file, you can write a regular expression to get the content of the blog post. We are interested only in the part where we have the links. So here is our code now:

import re
import requests

def get_page(url):
    r = requests.get(url)
    content = r.text.encode('utf-8', 'ignore')
    return content
       
       
if __name__ == "__main__":
    url = 'http://cpbook.subeen.com/p/blog-page_11.html'
    content = get_page(url)
    content_pattern = re.compile(r'<h3 class="\" entry-title="entry-title" post-title="post-title"> (.*?)<div class="\" post-footer="post-footer"> ')
    result = re.findall(content_pattern, content)
    print result


Now run the script:

$ python cpbook_crawler.py
[]

We got an empty list. So the regular expression didn't work. The problem is either with the regular expression or with the content. We can get rid of this problem by replacing the new-line characters in the content.

content = content.replace("\n", '')

You should add this line and run the program again.

Now we write the regular expression to get the list of the urls. And here is our code:

import re
import requests

if __name__ == "__main__":
    url = 'http://cpbook.subeen.com/p/blog-page_11.html'
    content = get_page(url)
    content = content.replace("\n", '')
   
    content_pattern = re.compile(r'<h3 class="\" entry-title="entry-title" post-title="post-title"> (.*?)<div class="\" post-footer="post-footer"> ')
    result = re.findall(content_pattern, content)
    data = result[0]
   
    url_pattern = re.compile(r'<a href="http://www.blogger.com/(.*?)">')
    result = re.findall(url_pattern, data)   
    print result


Now run this (don't forget to write the get_page() function):

$ python cpbook_crawler.py
['http://cpbook.subeen.com/2012/11/blog-post.html', 'http://cpbook.subeen.com/2012/11/positive-negative.html', 'http://cpbook.subeen.com/2012/11/count-numbers.html', 'http://cpbook.subeen.com/2012/11/rectangle-1.html', 'http://cpbook.subeen.com/2012/11/ascii-add.html', 'http://cpbook.subeen.com/2012/11/maximum-minimum-number.html', 'http://cpbook.subeen.com/2012/11/square-number.html', 'http://cpbook.subeen.com/2012/11/average-1.html', 'http://cpbook.subeen.com/2012/11/average-2.html', 'http://cpbook.subeen.com/2012/11/prime-number.html', 'http://cpbook.subeen.com/2012/11/age-calculation.html', 'http://cpbook.subeen.com/2012/11/count-digits.html', 'http://cpbook.subeen.com/2012/11/left-right.html', 'http://cpbook.subeen.com/2012/11/even-odd-1.html', 'http://cpbook.subeen.com/2012/11/even-odd-2.html', 'http://cpbook.subeen.com/2012/11/square-box-1.html', 'http://cpbook.subeen.com/2012/12/common-digits.html', 'http://vimeo.com/54188390', 'http://vimeo.com/user13634479', 'http://vimeo.com']


But we don't want the vimeo urls. So we need to rewrite the regular expression.

url_pattern = re.compile(r'<a href="http://www.blogger.com/(http://cpbook.subeen.com/.*?)">')

So, our final code is:

import re
import requests

def get_page(url):
    r = requests.get(url)
    content = r.text.encode('utf-8', 'ignore')
    return content
       
       
if __name__ == "__main__":
    url = 'http://cpbook.subeen.com/p/blog-page_11.html'
    content = get_page(url)
    content = content.replace("\n", '')
   
    content_pattern = re.compile(r'<h3 class="\" entry-title="entry-title" post-title="post-title"> (.*?)<div class="\" post-footer="post-footer"> ')
    result = re.findall(content_pattern, content)
    data = result[0]
   
    url_pattern = re.compile(r'<a href="http://www.blogger.com/(http://cpbook.subeen.com/.*?)">')
    problem_list = re.findall(url_pattern, data)   
    print problem_list
   

Run the program:

$ python cpbook_crawler.py
['http://cpbook.subeen.com/2012/11/blog-post.html', 'http://cpbook.subeen.com/2012/11/positive-negative.html', 'http://cpbook.subeen.com/2012/11/count-numbers.html', 'http://cpbook.subeen.com/2012/11/rectangle-1.html', 'http://cpbook.subeen.com/2012/11/ascii-add.html', 'http://cpbook.subeen.com/2012/11/maximum-minimum-number.html', 'http://cpbook.subeen.com/2012/11/square-number.html', 'http://cpbook.subeen.com/2012/11/average-1.html', 'http://cpbook.subeen.com/2012/11/average-2.html', 'http://cpbook.subeen.com/2012/11/prime-number.html', 'http://cpbook.subeen.com/2012/11/age-calculation.html', 'http://cpbook.subeen.com/2012/11/count-digits.html', 'http://cpbook.subeen.com/2012/11/left-right.html', 'http://cpbook.subeen.com/2012/11/even-odd-1.html', 'http://cpbook.subeen.com/2012/11/even-odd-2.html', 'http://cpbook.subeen.com/2012/11/square-box-1.html', 'http://cpbook.subeen.com/2012/12/common-digits.html']


So we are done!

Comments

Unknown said…
You could use \s+ in order to match new lines and spaces

Popular posts from this blog

lambda magic to find prime numbers

Convert text to ASCII and ASCII to text - Python code

Adjacency Matrix (Graph) in Python