life is short - you need Python!

Posts

Showing posts with the label Beautiful Soup

scrape macy's deals using beautiful soup

September 10, 2013

Let me show you a tiny real example on how to use the bs4 (beautiful soup version 4) module of Python. Say we want to collect information about the hot deals from macy's. The URL is here . Well, you can see all the info in one page and copy-paste, but that's not our purpose. First you have to get the content of the page using the cute requests module. import requests url = 'http://bit.ly/19zWmQT' r = requests.get(url) html_content = r.text Now start cooking the soup: from bs4 import BeautifulSoup soup = BeautifulSoup(html_content) Now look at the html code (page source code) of the url. You will see that the offers are in a list (li) that has 'offer' as a css class name (and some other class names). So you can write the code in the following way: offer_list = soup('li', 'offer') Or you can write: offer_list = soup.find_all('li', 'offer') Another way to write this is: offer_list = soup.select('li.offer') ...

Use Beautiful Soup for screen scraping in Python

March 23, 2008

Tired of writing web spiders/crawler/scrapers? You can try Beautiful Soup in Python. I also have decided to use it in my next spiders. From their website : Beautiful Soup is a Python HTML/XML parser designed for quick turnaround projects like screen-scraping. Three features make it powerful: 1. Beautiful Soup won't choke if you give it bad markup. It yields a parse tree that makes approximately as much sense as your original document. This is usually good enough to collect the data you need and run away. 2. Beautiful Soup provides a few simple methods and Pythonic idioms for navigating, searching, and modifying a parse tree: a toolkit for dissecting a document and extracting what you need. You don't have to create a custom parser for each application. 3. Beautiful Soup automatically converts incoming documents to Unicode and outgoing documents to UTF-8. You don't have to think about encodings, unless the document doesn't specify an encoding and Beautiful Soup c...