Python and Unicode: How everything should be Unicode

No, not every string "should be Unicode". Within your Python code, you know if the string literals needs to be Unicode or not, so it doesn't make any sense to make every string literal into a Unicode literal.

But there are cases where you should use Unicode. For example, if you have arbitrary input that is text, use Unicode for it. You will sooner or later find a non-american using it, and he want to wrîte têxt ås hé is üsed tö. And you'll get problems in that case unless your input and output happen to use the same encoding, which you can't be sure of.

So in short, no, strings shouldn't be Unicode. Text should be. But YMMV.

Specifically:

  1. No need to use Unicode here. You know if that string is ASCII or not.

  2. Depends if you need to merge those strings with Unicode or not.

  3. Both ways work. But do not encode decode "when required". Decode ASAP, encode as late as possible. Using codecs work well (or io, from Python 2.7).

  4. Yeah.


The "always use Unicode" suggestion is primarily to make the transition to Python 3 easier. If you have a lot of non-Unicode string access in your code, it'll take more work to port it.

Also, you shouldn't have to decide on a case-by-case basis whether a string should be stored as Unicode or not. You shouldn't have to change the types of your strings and their very syntax just because you changed their contents, either.

It's also easy to use the wrong string type, leading to code that mostly works, or code which works in Linux but not in Windows, or in one locale but not another. For example, for c in "漢字" in a UTF-8 locale will iterate over each UTF-8 byte (all six of them), not over each character; whether that breaks things depends on what you do with them.

In principle, nothing should break if you use Unicode strings, but things may break if you use regular strings when you shouldn't.

In practice, however, it's a pain to use Unicode strings everywhere in Python 2. codecs.open doesn't pick the correct locale automatically; this fails:

codecs.open("blar.txt", "w").write(u"漢字")

The real answer is:

import locale, codecs
lang, encoding = locale.getdefaultlocale()
codecs.open("blar.txt", "w", encoding).write(u"漢字")

... which is cumbersome, forcing people to make helper functions just to open files. codecs.open should be using the encoding from locale automatically when one isn't specified; Python's failure to make such a simple operation convenient is one of the reasons people generally don't use Unicode everywhere.

Finally, note that Unicode strings are even more critical in Windows in some cases. For example, if you're in a Western locale and you have a file named "漢字", you must use a Unicode string to access it, eg. os.stat(u"漢字"). It's impossible to access it with a non-Unicode string; it just won't see the file.

So, in principle I'd say the Unicode string recommendation is reasonable, but with the caveat that I don't generally even follow it myself.


IMHO (my simple rules):

  1. Should I do a: print u'Some text' or just print 'Text' ?

  2. Everything should be Unicode, does this mean, like say I have a tuple: t = ('First', 'Second'), it should be t = (u'First', u'Second')?

Well, I use unicode literals only when I have some char above ASCII 128:

   print 'New York', u'São Paulo'
   t = ('New York', u'São Paulo')
  1. When reading/ writing to a file, I should use the codecs module. Right? Or should I just use the standard way or reading/ writing and encode or decode where required?

If you expect unicode text, use codecs.

  1. If I get the string from say raw_input(), should I convert that to Unicode also?

Only if you expect unicode text that may get transfered to another system with distinct default encoding (including databases).

EDITED (about mixing unicode and byte strings):

>>> print 'New York', 'to', u'São Paulo'
New York to São Paulo
>>> print 'New York' + ' to ' + u'São Paulo'
New York to São Paulo
>>> print "Côte d'Azur" + ' to ' + u'São Paulo'
Traceback (most recent call last):
  File "<interactive input>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1: 
     ordinal not in range(128)
>>> print "Côte d'Azur".decode('utf-8') + ' to ' + u'São Paulo'
Côte d'Azur to São Paulo

So if you mix a byte string that contains utf-8 (or other non ascii char) with unicode text without explicit conversion, you will have trouble, because default assumes ascii. The other way arround seems to be safe. If you follow the rule of writing every string containing non-ascii as an unicode literal, you should be OK.

DISCLAIMER: I live in Brazil where people speak Portuguese, a language with lots of non-ascii chars. My default encoding is always set to 'utf-8'. Your mileage may vary in English/ascii systems.

Tags:

Python

Unicode