Get Original URL

Once I got into trouble while crawling some websites. Some of the URL I had wasn't the original URL, rather they were redirecting to some other URL. Then I came up with a function to get the original URL. Here I share it with you:


def get_original_url(url):
"""This function takes an url and returns the original url with cookie (if any)
"""
    cj = cookielib.CookieJar()
    opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
    opener.addheaders = [('User-agent', 'Mozilla/5.0')]
    usock = opener.open(url)
    url = usock.geturl()
    usock.close()
    return url, cj


Please send me your comments on this piece of code.

Comments

aatiis said…
Hi,
Could you please make it clearer how to use the CookieJar? I've coded a neat little script a few days ago that gets a random token from a website and uses urllib2 to get the cookie string: the string is response.info()['set-cookie']. I tried to use this string by adding it to the headers (Cookie: the_string), but it won't work.
So to be clear here, I have faked a user-agent, logged in, got the set-cookies header, and have no clue how to use it. Now I need to retrieve some pages using that cookie.
If you write a post about this, I'll write about your blog on mine (if it's ok). Thanks anyway.
Tamim Shahriar said…
Thanks for your post.

I shall try to write a post about how to use cookiejar soon (hopefully by this week), so stay tuned! :)
Mabel @ Myo Myo said…
I tried your code by using different proxies but the server responds only one html source every time different urls are requested. I want to try this but there is no solution for cookiejar. And the website I want to crawl is not required to use authentication. Please post the code or any alternatives?

Popular posts from this blog

Python all any built-in function

Accept-Encoding 'gzip' to make your cralwer faster

lambda magic to find prime numbers