Accept-Encoding 'gzip' to make your cralwer faster
If you are looking for a way to make your web crawler / spider work faster, then you must be looking for reducing the time to fetch data from the Internet. Fortunately many websites provide the content in compressed format if the server gets such type of request. If you see the request header of Firefox (you can use firebug to see the request / response headers), you can find that it always asks for gzip content. If the server provides gzip, then it sends data in gzip form otherwise it sends html. gzip data reduces bandwidth and thus make the program to scrape / harvest data faster.
So you can do the same from you Python code. Just add
Note the decode() function used in the code. Yes, you have to decode the content (if it's compressed).
You can also have a look at this page from the book - Dive Into Python: http://diveintopython.org/http_web_services/gzip_compression.html
If you would like to buy a hard copy of this book, get it from here: Dive Into Python
So you can do the same from you Python code. Just add
('Accept-Encoding', 'gzip,deflate')
in the request header. Check the following code chunk:
opener = urllib2.build_opener()
opener.addheaders = [('Referer', referer),
('User-Agent', uagent),
('Accept-Encoding', 'gzip,deflate')]
usock = opener.open(url)
url = usock.geturl()
data = decode(usock)
usock.close()
return data
Note the decode() function used in the code. Yes, you have to decode the content (if it's compressed).
def decode (page):
encoding = page.info().get("Content-Encoding")
if encoding in ('gzip', 'x-gzip', 'deflate'):
content = page.read()
if encoding == 'deflate':
data = StringIO.StringIO(zlib.decompress(content))
else:
data = gzip.GzipFile('', 'rb', 9, StringIO.StringIO(content))
page = data.read()
return page
You can also have a look at this page from the book - Dive Into Python: http://diveintopython.org/http_web_services/gzip_compression.html
If you would like to buy a hard copy of this book, get it from here: Dive Into Python
Comments
this is a boost ...
thnx for d code :)
encoding = page.info().get("Content-Encoding")
if encoding in ('gzip', 'x-gzip', 'deflate'):
content = page.read()
if encoding == 'deflate':
data = StringIO.StringIO(zlib.decompress(content))
else:
data = gzip.GzipFile('', 'rb', 9, StringIO.StringIO(content))
content = data.read()
else:
content = page.read()
return content
import gzip
import StringIO
import zlib
for python3
fetch = opener.open(request)
data = gzip.decompress(fetch.read())
data = str(data,'utf-8')
this will work...cheers