Accept-Encoding 'gzip' to make your cralwer faster

July 19, 2008

If you are looking for a way to make your web crawler / spider work faster, then you must be looking for reducing the time to fetch data from the Internet. Fortunately many websites provide the content in compressed format if the server gets such type of request. If you see the request header of Firefox (you can use firebug to see the request / response headers), you can find that it always asks for gzip content. If the server provides gzip, then it sends data in gzip form otherwise it sends html. gzip data reduces bandwidth and thus make the program to scrape / harvest data faster.

So you can do the same from you Python code. Just add ('Accept-Encoding', 'gzip,deflate') in the request header. Check the following code chunk:


opener = urllib2.build_opener()
opener.addheaders = [('Referer', referer),
   ('User-Agent', uagent),
   ('Accept-Encoding', 'gzip,deflate')]
usock = opener.open(url)
url = usock.geturl()
data = decode(usock)
usock.close()                
return data

Note the decode() function used in the code. Yes, you have to decode the content (if it's compressed).


def decode (page):
    encoding = page.info().get("Content-Encoding")    
    if encoding in ('gzip', 'x-gzip', 'deflate'):
        content = page.read()
        if encoding == 'deflate':
            data = StringIO.StringIO(zlib.decompress(content))
        else:
            data = gzip.GzipFile('', 'rb', 9, StringIO.StringIO(content))
        page = data.read()

    return page

You can also have a look at this page from the book - Dive Into Python: http://diveintopython.org/http_web_services/gzip_compression.html

If you would like to buy a hard copy of this book, get it from here: Dive Into Python

Comments

Avi Dullu said…

AWESUM !!!! ....

this is a boost ...
thnx for d code :)

August 3, 2008 at 2:17 AM

Tamim Shahriar said…

Thanks for your comment.

August 3, 2008 at 9:09 AM

Andrey said…

in case encoding is not gzipped, some corrections.

encoding = page.info().get("Content-Encoding")
if encoding in ('gzip', 'x-gzip', 'deflate'):
content = page.read()
if encoding == 'deflate':
data = StringIO.StringIO(zlib.decompress(content))
else:
data = gzip.GzipFile('', 'rb', 9, StringIO.StringIO(content))
content = data.read()
else:
content = page.read()
return content

May 16, 2009 at 10:15 AM

Anonymous said…

+ 1 for Andrey change

October 19, 2010 at 3:07 PM

Anonymous said…

Also you sould import this libraries:

import gzip
import StringIO
import zlib

October 19, 2010 at 3:20 PM

shatu said…

Its awesome... maybe it'll work only for python2

for python3

fetch = opener.open(request)
data = gzip.decompress(fetch.read())
data = str(data,'utf-8')

this will work...cheers

July 18, 2011 at 4:03 PM

Unknown said…

How do I drop "Accept-Encoding" field from header?

August 5, 2013 at 1:22 PM

Search This Blog

life is short - you need Python!

Accept-Encoding 'gzip' to make your cralwer faster

Comments

Popular posts from this blog

Python all any built-in function

lambda magic to find prime numbers