Get Original URL
Once I got into trouble while crawling some websites. Some of the URL I had wasn't the original URL, rather they were redirecting to some other URL. Then I came up with a function to get the original URL. Here I share it with you:
Please send me your comments on this piece of code.
def get_original_url(url):
"""This function takes an url and returns the original url with cookie (if any)
"""
cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
usock = opener.open(url)
url = usock.geturl()
usock.close()
return url, cj
Please send me your comments on this piece of code.
Comments
Could you please make it clearer how to use the CookieJar? I've coded a neat little script a few days ago that gets a random token from a website and uses urllib2 to get the cookie string: the string is response.info()['set-cookie']. I tried to use this string by adding it to the headers (Cookie: the_string), but it won't work.
So to be clear here, I have faked a user-agent, logged in, got the set-cookies header, and have no clue how to use it. Now I need to retrieve some pages using that cookie.
If you write a post about this, I'll write about your blog on mine (if it's ok). Thanks anyway.
I shall try to write a post about how to use cookiejar soon (hopefully by this week), so stay tuned! :)