python code to retrive links from web page

There are several ways to extract / retrieve links (URL) from web page using Python. Let me discuss few ways.

1. Using Beautiful Soup.
You can find code to retrieve links from web page using beautiful soup in it's documentation.
The code is simple:

import BeautifulSoup
import urllib2

print "Enter the URL: "
url = raw_input("> ")
usock = urllib2.urlopen(url)
soup = BeautifulSoup.BeautifulSoup(html_source)
links = BeautifulSoup.SoupStrainer('a')
for link in BeautifulSoup.BeautifulSoup(response, parseOnlyThese=links):
if link.has_key('href'):
print link['href']


2. I saw another code in the book dive into python that uses a html parser to extract url.

from sgmllib import SGMLParser

class URLLister(SGMLParser):
def reset(self):
SGMLParser.reset(self)
self.urls = []

def start_a(self, attrs):
href = [v for k, v in attrs if k=='href']
if href:
self.urls.extend(href)

if __name__ == "__main__":
import urllib2
print "Enter the URL: "
url = raw_input("> ")

usock = urllib2.urlopen(url)
parser = URLLister()
parser.feed(usock.read())
parser.close()
usock.close()

for url in parser.urls:
print url

Now try this URL for the above two programs: http://www.vworker.com
For the first one you will get few links (not all) and for the second one you will get an error message. :(
The reason is, there is an incorrect tag in the html source code (search the source code with the term 'strong')

3. So I had to write this python script using regular expression.

import re
import urllib2

def get_hyperlinks(url, source):
if url.endswith("/"):
url = url[:-1]

urlPat = re.compile(r'<a [^<>]*?href=("|\')([^<>"\']*?)("|\')')

result = re.findall(urlPat, source)

urlList = []

for item in result:
link = item[1]
if link.startswith("http://") and link.startswith(url):
if link not in urlList:
urlList.append(link)
elif link.startswith("/"):
link = url + link
if link not in urlList:
urlList.append(link)
else:
link = url + "/" + link
if link not in urlList:
urlList.append(link)

return urlList

print "Enter the URL: "
url = raw_input("> ")
usock = urllib2.urlopen(url)
data = usock.read()
usock.close()
print get_hyperlinks(url, data)

This code also takes care that there is no duplicate links. Now test this with http://www.vworker.com :)

Comments

Josh English said…
I was trying to build a page scraper myself, but this code keeps giving me urllib errors.

File "C:\Python26\lib\urllib2.py", line 244, in get_type
raise ValueError, "unknown url type: %s" % self.__original
ValueError: unknown url type:

Very strange.

Also, the SGMLParser solution offered by Mark Pilgrim is depreciated in newer versions of Python.
Unknown said…
Hi, i read the post. Its nice code to retrieving links of given url. But some codes is missing.

The url pattern should be:

urlPat = re.compile(r'[<>]*?href=("|\')([^<>"\']*?)("|\')')

I guess you forget the bracket [

Thanks for the post.
Tamim Shahriar said…
Actually my regular expression is:

urlPat = re.compile(r'<a [^<>]*?href=("|\')([^<>"\']*?)("|\')')
Unknown said…
I'll try new one, thanks again.
Unknown said…
i am using Pyquery. very very simple.

Popular posts from this blog

Strip HTML tags using Python

lambda magic to find prime numbers

Convert text to ASCII and ASCII to text - Python code