Saturday, July 19, 2008

Accept-Encoding 'gzip' to make your cralwer faster

If you are looking for a way to make your web crawler / spider work faster, then you must be looking for reducing the time to fetch data from the Internet. Fortunately many websites provide the content in compressed format if the server gets such type of request. If you see the request header of Firefox (you can use firebug to see the request / response headers), you can find that it always asks for gzip content. If the server provides gzip, then it sends data in gzip form otherwise it sends html. gzip data reduces bandwidth and thus make the program to scrape / harvest data faster.

So you can do the same from you Python code. Just add ('Accept-Encoding', 'gzip,deflate') in the request header. Check the following code chunk:

opener = urllib2.build_opener()
opener.addheaders = [('Referer', referer),
('User-Agent', uagent),
('Accept-Encoding', 'gzip,deflate')]
usock = opener.open(url)
url = usock.geturl()
data = decode(usock)
usock.close()
return data

Note the decode() function used in the code. Yes, you have to decode the content (if it's compressed).

def decode (page):
encoding = page.info().get("Content-Encoding")
if encoding in ('gzip', 'x-gzip', 'deflate'):
content = page.read()
if encoding == 'deflate':
data = StringIO.StringIO(zlib.decompress(content))
else:
data = gzip.GzipFile('', 'rb', 9, StringIO.StringIO(content))
page = data.read()

return page


You can also have a look at this page from the book - Dive Into Python: http://diveintopython.org/http_web_services/gzip_compression.html

If you would like to buy a hard copy of this book, get it from here: Dive Into Python

7 comments:

avi.dullu said...

AWESUM !!!! ....

this is a boost ...
thnx for d code :)

subeen said...

Thanks for your comment.

Andrey said...

in case encoding is not gzipped, some corrections.

encoding = page.info().get("Content-Encoding")
if encoding in ('gzip', 'x-gzip', 'deflate'):
content = page.read()
if encoding == 'deflate':
data = StringIO.StringIO(zlib.decompress(content))
else:
data = gzip.GzipFile('', 'rb', 9, StringIO.StringIO(content))
content = data.read()
else:
content = page.read()
return content

andresgutgon said...

+ 1 for Andrey change

andresgutgon said...

Also you sould import this libraries:

import gzip
import StringIO
import zlib

shatu said...

Its awesome... maybe it'll work only for python2

for python3

fetch = opener.open(request)
data = gzip.decompress(fetch.read())
data = str(data,'utf-8')

this will work...cheers

Riyad Parvez said...

How do I drop "Accept-Encoding" field from header?