Accept-Encoding 'gzip' to make your cralwer faster

If you are looking for a way to make your web crawler / spider work faster, then you must be looking for reducing the time to fetch data from the Internet. Fortunately many websites provide the content in compressed format if the server gets such type of request. If you see the request header of Firefox (you can use firebug to see the request / response headers), you can find that it always asks for gzip content. If the server provides gzip, then it sends data in gzip form otherwise it sends html. gzip data reduces bandwidth and thus make the program to scrape / harvest data faster.

So you can do the same from you Python code. Just add ('Accept-Encoding', 'gzip,deflate') in the request header. Check the following code chunk:

opener = urllib2.build_opener()
opener.addheaders = [('Referer', referer),
('User-Agent', uagent),
('Accept-Encoding', 'gzip,deflate')]
usock = opener.open(url)
url = usock.geturl()
data = decode(usock)
usock.close()
return data

Note the decode() function used in the code. Yes, you have to decode the content (if it's compressed).

def decode (page):
encoding = page.info().get("Content-Encoding")
if encoding in ('gzip', 'x-gzip', 'deflate'):
content = page.read()
if encoding == 'deflate':
data = StringIO.StringIO(zlib.decompress(content))
else:
data = gzip.GzipFile('', 'rb', 9, StringIO.StringIO(content))
page = data.read()

return page


You can also have a look at this page from the book - Dive Into Python: http://diveintopython.org/http_web_services/gzip_compression.html

If you would like to buy a hard copy of this book, get it from here: Dive Into Python

Comments

Avi Dullu said…
AWESUM !!!! ....

this is a boost ...
thnx for d code :)
Tamim Shahriar said…
Thanks for your comment.
Andrey said…
in case encoding is not gzipped, some corrections.

encoding = page.info().get("Content-Encoding")
if encoding in ('gzip', 'x-gzip', 'deflate'):
content = page.read()
if encoding == 'deflate':
data = StringIO.StringIO(zlib.decompress(content))
else:
data = gzip.GzipFile('', 'rb', 9, StringIO.StringIO(content))
content = data.read()
else:
content = page.read()
return content
Anonymous said…
+ 1 for Andrey change
Anonymous said…
Also you sould import this libraries:

import gzip
import StringIO
import zlib
shatu said…
Its awesome... maybe it'll work only for python2

for python3

fetch = opener.open(request)
data = gzip.decompress(fetch.read())
data = str(data,'utf-8')

this will work...cheers
Unknown said…
How do I drop "Accept-Encoding" field from header?

Popular posts from this blog

Strip HTML tags using Python

lambda magic to find prime numbers

Convert text to ASCII and ASCII to text - Python code