1. Using Beautiful Soup.
You can find code to retrieve links from web page using beautiful soup in it's documentation.
The code is simple:
import BeautifulSoup
import urllib2
print "Enter the URL: "
url = raw_input("> ")
usock = urllib2.urlopen(url)
soup = BeautifulSoup.BeautifulSoup(html_source)
links = BeautifulSoup.SoupStrainer('a')
for link in BeautifulSoup.BeautifulSoup(response, parseOnlyThese=links):
if link.has_key('href'):
print link['href']
2. I saw another code in the book dive into python that uses a html parser to extract url.
from sgmllib import SGMLParser
class URLLister(SGMLParser):
def reset(self):
SGMLParser.reset(self)
self.urls = []
def start_a(self, attrs):
href = [v for k, v in attrs if k=='href']
if href:
self.urls.extend(href)
if __name__ == "__main__":
import urllib2
print "Enter the URL: "
url = raw_input("> ")
usock = urllib2.urlopen(url)
parser = URLLister()
parser.feed(usock.read())
parser.close()
usock.close()
for url in parser.urls:
print url
Now try this URL for the above two programs: http://www.vworker.com
For the first one you will get few links (not all) and for the second one you will get an error message. :(
The reason is, there is an incorrect tag in the html source code (search the source code with the term 'strong')
3. So I had to write this python script using regular expression.
import re
import urllib2
def get_hyperlinks(url, source):
if url.endswith("/"):
url = url[:-1]
urlPat = re.compile(r'<a [^<>]*?href=("|\')([^<>"\']*?)("|\')')
result = re.findall(urlPat, source)
urlList = []
for item in result:
link = item[1]
if link.startswith("http://") and link.startswith(url):
if link not in urlList:
urlList.append(link)
elif link.startswith("/"):
link = url + link
if link not in urlList:
urlList.append(link)
else:
link = url + "/" + link
if link not in urlList:
urlList.append(link)
return urlList
print "Enter the URL: "
url = raw_input("> ")
usock = urllib2.urlopen(url)
data = usock.read()
usock.close()
print get_hyperlinks(url, data)
This code also takes care that there is no duplicate links. Now test this with http://www.vworker.com :)
5 comments:
I was trying to build a page scraper myself, but this code keeps giving me urllib errors.
File "C:\Python26\lib\urllib2.py", line 244, in get_type
raise ValueError, "unknown url type: %s" % self.__original
ValueError: unknown url type:
Very strange.
Also, the SGMLParser solution offered by Mark Pilgrim is depreciated in newer versions of Python.
Hi, i read the post. Its nice code to retrieving links of given url. But some codes is missing.
The url pattern should be:
urlPat = re.compile(r'[<>]*?href=("|\')([^<>"\']*?)("|\')')
I guess you forget the bracket [
Thanks for the post.
Actually my regular expression is:
urlPat = re.compile(r'<a [^<>]*?href=("|\')([^<>"\']*?)("|\')')
I'll try new one, thanks again.
i am using Pyquery. very very simple.
Post a Comment