Python UNICODE encode / decode error
Today I was trying to scrape a Spanish site and got into trouble with some Spanish characters. I had to parse some messages from that Spanish website and post into twitter using my Python script. But for some reasons Spanish characters didn't show up in twitter status updates.
I was in some trouble with the following error messages:
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 5778-5781: invalid data
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf3' in position 5778: ordinal not in range(128)
Then I did Googling for some time and got this very useful link.
I used this code to get the content of the website:
Though I first used utf-8 encoding rather than latin_1, but when I got this error: "UnicodeDecodeError: 'utf8' codec can't decode bytes in position 5778-5781: invalid data" I found that the website is using latin_1 character set (from html source) not utf-8.
Don't forget to check the codecs module. Btw, I am using Python 2.5.2. :)
I was in some trouble with the following error messages:
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 5778-5781: invalid data
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf3' in position 5778: ordinal not in range(128)
Then I did Googling for some time and got this very useful link.
I used this code to get the content of the website:
import codecs
import urllib2
url = '' # put the URL here
usock = urllib2.urlopen(url)
Reader = codecs.getreader("latin_1")
fh = Reader(usock)
data = fh.read()
fh.close()
usock.close()
data = data.encode("latin_1")
Though I first used utf-8 encoding rather than latin_1, but when I got this error: "UnicodeDecodeError: 'utf8' codec can't decode bytes in position 5778-5781: invalid data" I found that the website is using latin_1 character set (from html source) not utf-8.
Don't forget to check the codecs module. Btw, I am using Python 2.5.2. :)
Comments
http://chardet.feedparser.org/