UnicodeDecodeError in CSV writer in Python

I faced another error with UnicodeWriter class. Now it's
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 21: ordinal not in range(128).
More details:

Traceback (most recent call last):
  File "cpx_parser.py", line 284, in
    main()
  File "cpx_parser.py", line 278, in main
    writer.writerow(csv_li)
  File "cpx_parser.py", line 29, in writerow
    self.writer.writerow([s.encode("utf-8") for s in row])
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 21: ordinal not in range(128)


So, I had to decode s using utf-8, then convert it into unicode. And that solved the problem. Here is what I did:
self.writer.writerow([unicode(s.decode("utf-8")).encode("utf-8") for s in row])

To know more about unicode, you can check this link.

Comments

Anonymous said…
I think you'll find that the original version works only for Unicode strings and your new version works only for UTF-8 encoded byte strings.

It looks like UnicodeWriter expects to be passed Unicode strings, which it will then encode to UTF-8 byte strings. When you instead pass it a byte string, Python has to convert it to Unicode before it can be encoded, which is does by decoding with the default codec, which is ASCII; since your string apparently contains non-ASCII characters, you get this exception.

This is all much less confusing in Python 3, where strings are always unicode and the old string type has been renamed to "bytes"; the nonsensical bytes.encode and unicode.decode methods have been removed, leaving bytes.decode and unicode.encode.

Anyway, to give UnicodeWriter what it expects, try changing the code in your previous post to convert things to Unicode instead of bytes; i.e. change this:

row = [str(item) for item in row]

to this:

row = [unicode(item) for item in row]

The solution in this post (which, as I said, solves the problem for byte strings but creates a new problem with unicode strings) is, as far as I can tell, a convoluted no-op:

>>> s = u"räksmörgås".encode("utf-8")
>>> unicode(s.decode("utf-8")).encode("utf-8") == s
True

In other words, it assumes that the input is UTF-8, decodes it and then encodes it again as UTF-8, so it should be equivalent to just using the original row as-is:

self.writer.writerow(row)

/Henrik
Anonymous said…
Hm, maybe I didn't fully think that through. What I suggested would still break if you passed it non-ASCII byte strings.

So...

If you have Unicode strings, use UnicodeWriter.

If you have pre-encoded byte strings, use csv.writer.

If you have a mixture of Unicode strings, UTF-8 encoded byte strings and other data types, this should work (with UnicodeWriter):

row = [item.decode("utf-8") if isinstance(item, str) else unicode(item) for item in row]

...or with csv.writer:

row = [item.encode("utf-8") if isinstance(item, unicode) else str(item) for item in row]

Continuing my theme of hyping Python 3, the issue appears to have been solved there.

/Henrik

Popular posts from this blog

Strip HTML tags using Python

lambda magic to find prime numbers

Convert text to ASCII and ASCII to text - Python code