Get Original URL

Once I got into trouble while crawling some websites. Some of the URL I had wasn't the original URL, rather they were redirecting to some other URL. Then I came up with a function to get the original URL. Here I share it with you:


def get_original_url(url):
"""This function takes an url and returns the original url with cookie (if any)
"""
    cj = cookielib.CookieJar()
    opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
    opener.addheaders = [('User-agent', 'Mozilla/5.0')]
    usock = opener.open(url)
    url = usock.geturl()
    usock.close()
    return url, cj


Please send me your comments on this piece of code.

Comments

aatiis said…
Hi,
Could you please make it clearer how to use the CookieJar? I've coded a neat little script a few days ago that gets a random token from a website and uses urllib2 to get the cookie string: the string is response.info()['set-cookie']. I tried to use this string by adding it to the headers (Cookie: the_string), but it won't work.
So to be clear here, I have faked a user-agent, logged in, got the set-cookies header, and have no clue how to use it. Now I need to retrieve some pages using that cookie.
If you write a post about this, I'll write about your blog on mine (if it's ok). Thanks anyway.
Tamim Shahriar said…
Thanks for your post.

I shall try to write a post about how to use cookiejar soon (hopefully by this week), so stay tuned! :)
Mabel @ Myo Myo said…
I tried your code by using different proxies but the server responds only one html source every time different urls are requested. I want to try this but there is no solution for cookiejar. And the website I want to crawl is not required to use authentication. Please post the code or any alternatives?

Popular posts from this blog

Strip HTML tags using Python

lambda magic to find prime numbers

Convert text to ASCII and ASCII to text - Python code