Python UNICODE encode / decode error

Today I was trying to scrape a Spanish site and got into trouble with some Spanish characters. I had to parse some messages from that Spanish website and post into twitter using my Python script. But for some reasons Spanish characters didn't show up in twitter status updates.

I was in some trouble with the following error messages:
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 5778-5781: invalid data
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf3' in position 5778: ordinal not in range(128)

Then I did Googling for some time and got this very useful link.

I used this code to get the content of the website:

import codecs
import urllib2

url = '' # put the URL here
usock = urllib2.urlopen(url)
Reader = codecs.getreader("latin_1")
fh = Reader(usock)
data = fh.read()
fh.close()
usock.close()
data = data.encode("latin_1")


Though I first used utf-8 encoding rather than latin_1, but when I got this error: "UnicodeDecodeError: 'utf8' codec can't decode bytes in position 5778-5781: invalid data" I found that the website is using latin_1 character set (from html source) not utf-8.

Don't forget to check the codecs module. Btw, I am using Python 2.5.2. :)

Comments

Andrei Savu said…
... and don't forget to check chardet

http://chardet.feedparser.org/
Tamim Shahriar said…
Thanks Andrei for the cool link.
wingi said…
You should encode and decode the correct charsets - dont try the several charsets until it works (for you) ! Your spanish site with serve the correct encoding in Header or html. And the twitter api will accept utf-8.
Unknown said…
i am new in python .i can't print Bengali in console.can you suggest any way or sample code?

Popular posts from this blog

Python all any built-in function

Accept-Encoding 'gzip' to make your cralwer faster

lambda magic to find prime numbers