Get Original URL

February 19, 2008

Once I got into trouble while crawling some websites. Some of the URL I had wasn't the original URL, rather they were redirecting to some other URL. Then I came up with a function to get the original URL. Here I share it with you:


def get_original_url(url):
"""This function takes an url and returns the original url with cookie (if any)
"""
    cj = cookielib.CookieJar()
    opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
    opener.addheaders = [('User-agent', 'Mozilla/5.0')]
    usock = opener.open(url)
    url = usock.geturl()
    usock.close()  
    return url, cj

Please send me your comments on this piece of code.

Comments

aatiis said…

Hi,
Could you please make it clearer how to use the CookieJar? I've coded a neat little script a few days ago that gets a random token from a website and uses urllib2 to get the cookie string: the string is response.info()['set-cookie']. I tried to use this string by adding it to the headers (Cookie: the_string), but it won't work.
So to be clear here, I have faked a user-agent, logged in, got the set-cookies header, and have no clue how to use it. Now I need to retrieve some pages using that cookie.
If you write a post about this, I'll write about your blog on mine (if it's ok). Thanks anyway.

May 11, 2008 at 2:14 PM

Tamim Shahriar said…

Thanks for your post.

I shall try to write a post about how to use cookiejar soon (hopefully by this week), so stay tuned! :)

May 11, 2008 at 9:02 PM

Mabel @ Myo Myo said…

I tried your code by using different proxies but the server responds only one html source every time different urls are requested. I want to try this but there is no solution for cookiejar. And the website I want to crawl is not required to use authentication. Please post the code or any alternatives?

March 19, 2013 at 9:13 AM

Search This Blog

life is short - you need Python!

Get Original URL

Comments

Popular posts from this blog

Python all any built-in function

Accept-Encoding 'gzip' to make your cralwer faster

lambda magic to find prime numbers