Updated python code for get html source

Yesterday I made little update to my function get_html_source() that gets the content of a page. I did so because I found that my previous function didn't support HTTP POST. Now the code supports both HTTP GET and HTTP POST. It also returns the cookiejar along with the html content of the page.

def get_html_source(url, referer = '', data = 0, cj = 0, retry_counter = 0):
if retry_counter > 0:
print 'Trying Again...'
if retry_counter > 3:
print 'Could not get source from url:', url
return '', ''
try:
if cj:
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
else:
opener = urllib2.build_opener()

opener.addheaders = [('Referer', referer),
('Content-Type', 'application/x-www-form-urlencoded'),
('User-Agent', 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.14) Gecko/20080404 Firefox/2.0.0.14'),
('Accept-Encoding', 'gzip,deflate')]

if data:
# HTTP POST
usock = opener.open(url, data)
else:
# HTTP GET
usock = opener.open(url)

content = decode(usock) # I think I have already written the code of decode function
# in another post. If you can't find it, just leave a comment
# here and I shall post the code again.
usock.close()
return content, cj
except urllib2.HTTPError, e:
print 'The server couldn\'t fulfill the request. for url: ', url
print 'Error code: ', e.code
return get_html_source(url, referer, data, cj, retry_counter + 1)
except urllib2.URLError, e:
print 'We failed to reach a server.'
print 'Reason: ', e.reason
return get_html_source(url, referer, data, cj, retry_counter + 1)



Please suggest any necessary update / modification of this code.

Comments

Andrey said…
The error handling code should be like

...
return get_html_source(url, referer, data, cj, retry_counter + 1)
...

not just get_source. Is it so?
Tamim Shahriar said…
Andrey, you are right. Thanks for pointing it out.
sagar said…
This comment has been removed by a blog administrator.
Juanmi said…
Hi, I don't find the decode function, could you post it again?

Thanks for your nice examples.
Tamim Shahriar said…
Juan, please check this
post.
Juanmi said…
I am blind :S Thanks
Robert said…
SyntaxError: invalid syntax

Popular posts from this blog

Strip HTML tags using Python

lambda magic to find prime numbers

Convert text to ASCII and ASCII to text - Python code