Python UNICODE encode / decode error

Today I was trying to scrape a Spanish site and got into trouble with some Spanish characters. I had to parse some messages from that Spanish website and post into twitter using my Python script. But for some reasons Spanish characters didn't show up in twitter status updates.

I was in some trouble with the following error messages:
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 5778-5781: invalid data
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf3' in position 5778: ordinal not in range(128)

Then I did Googling for some time and got this very useful link.

I used this code to get the content of the website:

import codecs
import urllib2

url = '' # put the URL here
usock = urllib2.urlopen(url)
Reader = codecs.getreader("latin_1")
fh = Reader(usock)
data = fh.read()
fh.close()
usock.close()
data = data.encode("latin_1")


Though I first used utf-8 encoding rather than latin_1, but when I got this error: "UnicodeDecodeError: 'utf8' codec can't decode bytes in position 5778-5781: invalid data" I found that the website is using latin_1 character set (from html source) not utf-8.

Don't forget to check the codecs module. Btw, I am using Python 2.5.2. :)

Comments

Andrei Savu said…
... and don't forget to check chardet

http://chardet.feedparser.org/
Tamim Shahriar said…
Thanks Andrei for the cool link.
wingi said…
You should encode and decode the correct charsets - dont try the several charsets until it works (for you) ! Your spanish site with serve the correct encoding in Header or html. And the twitter api will accept utf-8.
Unknown said…
i am new in python .i can't print Bengali in console.can you suggest any way or sample code?

Popular posts from this blog

Strip HTML tags using Python

lambda magic to find prime numbers

Convert text to ASCII and ASCII to text - Python code