Python UNICODE encode / decode error

November 12, 2009

Today I was trying to scrape a Spanish site and got into trouble with some Spanish characters. I had to parse some messages from that Spanish website and post into twitter using my Python script. But for some reasons Spanish characters didn't show up in twitter status updates.

I was in some trouble with the following error messages:
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 5778-5781: invalid data
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf3' in position 5778: ordinal not in range(128)

Then I did Googling for some time and got this very useful link.

I used this code to get the content of the website:


import codecs
import urllib2

url = '' # put the URL here
usock = urllib2.urlopen(url)
Reader = codecs.getreader("latin_1")
fh = Reader(usock)
data = fh.read()
fh.close()
usock.close()
data = data.encode("latin_1")

Though I first used utf-8 encoding rather than latin_1, but when I got this error: "UnicodeDecodeError: 'utf8' codec can't decode bytes in position 5778-5781: invalid data" I found that the website is using latin_1 character set (from html source) not utf-8.

Don't forget to check the codecs module. Btw, I am using Python 2.5.2. :)

Comments

Andrei Savu said…

... and don't forget to check chardet

http://chardet.feedparser.org/

November 12, 2009 at 2:03 AM

Tamim Shahriar said…

Thanks Andrei for the cool link.

November 12, 2009 at 2:06 AM

wingi said…

You should encode and decode the correct charsets - dont try the several charsets until it works (for you) ! Your spanish site with serve the correct encoding in Header or html. And the twitter api will accept utf-8.

April 5, 2010 at 2:07 AM

Unknown said…

i am new in python .i can't print Bengali in console.can you suggest any way or sample code?

August 1, 2014 at 8:24 AM

Search This Blog

life is short - you need Python!

Python UNICODE encode / decode error

Comments

Popular posts from this blog

Python all any built-in function

Accept-Encoding 'gzip' to make your cralwer faster

lambda magic to find prime numbers