python code to retrive links from web page

September 03, 2010

There are several ways to extract / retrieve links (URL) from web page using Python. Let me discuss few ways.

1. Using Beautiful Soup.
You can find code to retrieve links from web page using beautiful soup in it's documentation.
The code is simple:


import BeautifulSoup
import urllib2

print "Enter the URL: "
url = raw_input("> ")
usock = urllib2.urlopen(url)
soup = BeautifulSoup.BeautifulSoup(html_source)
links = BeautifulSoup.SoupStrainer('a')
for link in BeautifulSoup.BeautifulSoup(response, parseOnlyThese=links): 
 if link.has_key('href'):
     print link['href']

2. I saw another code in the book dive into python that uses a html parser to extract url.


from sgmllib import SGMLParser

class URLLister(SGMLParser):
 def reset(self):
     SGMLParser.reset(self)
     self.urls = []   

 def start_a(self, attrs):
     href = [v for k, v in attrs if k=='href']
     if href:
         self.urls.extend(href)

if __name__ == "__main__":
 import urllib2
 print "Enter the URL: "
 url = raw_input("> ")

 usock = urllib2.urlopen(url) 
 parser = URLLister() 
 parser.feed(usock.read()) 
 parser.close()
 usock.close()

 for url in parser.urls:
     print url

Now try this URL for the above two programs: http://www.vworker.com
For the first one you will get few links (not all) and for the second one you will get an error message. :(
The reason is, there is an incorrect tag in the html source code (search the source code with the term 'strong')

3. So I had to write this python script using regular expression.


import re
import urllib2

def get_hyperlinks(url, source): 
 if url.endswith("/"):
     url = url[:-1]

 urlPat = re.compile(r'<a [^<>]*?href=("|\')([^<>"\']*?)("|\')')

 result = re.findall(urlPat, source)

 urlList = []

 for item in result:
     link = item[1]     
     if link.startswith("http://") and link.startswith(url):
         if link not in urlList:
             urlList.append(link)
     elif link.startswith("/"):
         link = url + link
         if link not in urlList:
             urlList.append(link)
     else:
         link = url + "/" + link
         if link not in urlList:
             urlList.append(link)
 
 return urlList

print "Enter the URL: "
url = raw_input("> ")
usock = urllib2.urlopen(url)
data = usock.read()
usock.close()
print get_hyperlinks(url, data)

This code also takes care that there is no duplicate links. Now test this with http://www.vworker.com :)

Comments

Josh English said…

I was trying to build a page scraper myself, but this code keeps giving me urllib errors.

File "C:\Python26\lib\urllib2.py", line 244, in get_type
raise ValueError, "unknown url type: %s" % self.__original
ValueError: unknown url type:

Very strange.

Also, the SGMLParser solution offered by Mark Pilgrim is depreciated in newer versions of Python.

September 3, 2010 at 4:16 AM

Unknown said…

Hi, i read the post. Its nice code to retrieving links of given url. But some codes is missing.

The url pattern should be:

urlPat = re.compile(r'[<>]*?href=("|\')([^<>"\']*?)("|\')')

I guess you forget the bracket [

Thanks for the post.

September 3, 2010 at 6:33 AM

Tamim Shahriar said…

Actually my regular expression is:

urlPat = re.compile(r'<a [^<>]*?href=("|\')([^<>"\']*?)("|\')')

September 3, 2010 at 7:47 AM

Unknown said…

I'll try new one, thanks again.

September 3, 2010 at 8:13 AM

Unknown said…

i am using Pyquery. very very simple.

February 16, 2011 at 7:05 PM