python code to retrive links from web page
There are several ways to extract / retrieve links (URL) from web page using Python. Let me discuss few ways.
1. Using Beautiful Soup.
You can find code to retrieve links from web page using beautiful soup in it's documentation.
The code is simple:
2. I saw another code in the book dive into python that uses a html parser to extract url.
Now try this URL for the above two programs: http://www.vworker.com
For the first one you will get few links (not all) and for the second one you will get an error message. :(
The reason is, there is an incorrect tag in the html source code (search the source code with the term 'strong')
3. So I had to write this python script using regular expression.
This code also takes care that there is no duplicate links. Now test this with http://www.vworker.com :)
1. Using Beautiful Soup.
You can find code to retrieve links from web page using beautiful soup in it's documentation.
The code is simple:
import BeautifulSoup
import urllib2
print "Enter the URL: "
url = raw_input("> ")
usock = urllib2.urlopen(url)
soup = BeautifulSoup.BeautifulSoup(html_source)
links = BeautifulSoup.SoupStrainer('a')
for link in BeautifulSoup.BeautifulSoup(response, parseOnlyThese=links):
if link.has_key('href'):
print link['href']
2. I saw another code in the book dive into python that uses a html parser to extract url.
from sgmllib import SGMLParser
class URLLister(SGMLParser):
def reset(self):
SGMLParser.reset(self)
self.urls = []
def start_a(self, attrs):
href = [v for k, v in attrs if k=='href']
if href:
self.urls.extend(href)
if __name__ == "__main__":
import urllib2
print "Enter the URL: "
url = raw_input("> ")
usock = urllib2.urlopen(url)
parser = URLLister()
parser.feed(usock.read())
parser.close()
usock.close()
for url in parser.urls:
print url
Now try this URL for the above two programs: http://www.vworker.com
For the first one you will get few links (not all) and for the second one you will get an error message. :(
The reason is, there is an incorrect tag in the html source code (search the source code with the term 'strong')
3. So I had to write this python script using regular expression.
import re
import urllib2
def get_hyperlinks(url, source):
if url.endswith("/"):
url = url[:-1]
urlPat = re.compile(r'<a [^<>]*?href=("|\')([^<>"\']*?)("|\')')
result = re.findall(urlPat, source)
urlList = []
for item in result:
link = item[1]
if link.startswith("http://") and link.startswith(url):
if link not in urlList:
urlList.append(link)
elif link.startswith("/"):
link = url + link
if link not in urlList:
urlList.append(link)
else:
link = url + "/" + link
if link not in urlList:
urlList.append(link)
return urlList
print "Enter the URL: "
url = raw_input("> ")
usock = urllib2.urlopen(url)
data = usock.read()
usock.close()
print get_hyperlinks(url, data)
This code also takes care that there is no duplicate links. Now test this with http://www.vworker.com :)
Comments
File "C:\Python26\lib\urllib2.py", line 244, in get_type
raise ValueError, "unknown url type: %s" % self.__original
ValueError: unknown url type:
Very strange.
Also, the SGMLParser solution offered by Mark Pilgrim is depreciated in newer versions of Python.
The url pattern should be:
urlPat = re.compile(r'[<>]*?href=("|\')([^<>"\']*?)("|\')')
I guess you forget the bracket [
Thanks for the post.
urlPat = re.compile(r'<a [^<>]*?href=("|\')([^<>"\']*?)("|\')')