Set encoding in Python 3 CGI scripts

Answering this for late-comers because I don't think that the posted answers get to the root of the problem, which is the lack of locale environment variables in a CGI context. I'm using Python 3.2.

  1. open() opens file objects in text (string) or binary (bytes) mode for reading and/or writing; in text mode the encoding used to encode strings written to the file, and decode bytes read from the file, may be specified in the call; if it isn't then it is determined by locale.getpreferredencoding(), which on linux uses the encoding from your locale environment settings, which is normally utf-8 (from e.g. LANG=en_US.UTF-8)

    >>> f = open('foo', 'w')         # open file for writing in text mode
    >>> f.encoding
    'UTF-8'                          # encoding is from the environment
    >>> f.write('€')                 # write a Unicode string
    1
    >>> f.close()
    >>> exit()
    user@host:~$ hd foo
    00000000  e2 82 ac      |...|    # data is UTF-8 encoded
    
  2. sys.stdout is in fact a file opened for writing in text mode with an encoding based on locale.getpreferredencoding(); you can write strings to it just fine and they'll be encoded to bytes based on sys.stdout's encoding; print() by default writes to sys.stdout - print() itself has no encoding, rather it's the file it writes to that has an encoding;

    >>> sys.stdout.encoding
    'UTF-8'                          # encoding is from the environment
    >>> exit()
    user@host:~$ python3 -c 'print("€")' > foo
    user@host:~$ hd foo
    00000000  e2 82 ac 0a   |....|   # data is UTF-8 encoded; \n is from print()
    

    ; you cannot write bytes to sys.stdout - use sys.stdout.buffer.write() for that; if you try to write bytes to sys.stdout using sys.stdout.write() then it will return an error, and if you try using print() then print() will simply turn the bytes object into a string object and an escape sequence like \xff will be treated as the four characters \, x, f, f

    user@host:~$ python3 -c 'print(b"\xe2\xf82\xac")' > foo
    user@host:~$ hd foo
    00000000  62 27 5c 78 65 32 5c 78  66 38 32 5c 78 61 63 27  |b'\xe2\xf82\xac'|
    00000010  0a                                                |.|
    
  3. in a CGI script you need to write to sys.stdout and you can use print() to do it; but a CGI script process in Apache has no locale environment settings - they are not part of the CGI specification; therefore the sys.stdout encoding defaults to ANSI_X3.4-1968 - in other words, ASCII; if you try to print() a string that contain non-ASCII characters to sys.stdout you'll get "UnicodeEncodeError: 'ascii' codec can't encode character...: ordinal not in range(128)"

  4. a simple solution is to pass the Apache process's LANG environment variable through to the CGI script using Apache's mod_env PassEnv command in the server or virtual host configuration: PassEnv LANG; on Debian/Ubuntu make sure that in /etc/apache2/envvars you have uncommented the line ". /etc/default/locale" so that Apache runs with the system default locale and not the C (Posix) locale (which is also ASCII encoding); the following CGI script should run without errors in Python 3.2:

    #!/usr/bin/env python3
    import sys
    print('Content-Type: text/html; charset=utf-8')
    print()
    print('<html><body><pre>' + sys.stdout.encoding + '</pre>h€lló wörld<body></html>')
    

          


I solved my problem with the following code:

import locale                                  # Ensures that subsequent open()s 
locale.getpreferredencoding = lambda: 'UTF-8'  # are UTF-8 encoded.

import sys                                     
sys.stdin = open('/dev/stdin', 'r')       # Re-open standard files in UTF-8 
sys.stdout = open('/dev/stdout', 'w')     # mode.
sys.stderr = open('/dev/stderr', 'w') 

This solution is not pretty, but it seems to work for the time being. I actually chose Python 3 over the more commonplace v. 2.6 as my development platform due to the advertised good Unicode-handling, but the cgi package seems to ruin some of that simpleness.

I'm led to believe that the /dev/std* files may not exist on older systems that do not have a procfs. They are supported on recent Linuxes, however.


Summarizing @cercatrova 's answer:

  • Add PassEnv LANG line to the end of your /etc/apache2/apache2.conf or .htaccess.
  • Uncomment . /etc/default/locale line in /etc/apache2/envvars.
  • Make sure line similar to LANG="en_US.UTF-8" is present in /etc/default/locale.
  • sudo service apache2 restart

You shouldn't read your IO streams as strings for CGI/WSGI; they aren't Unicode strings, they're explicitly byte sequences.

(Consider that Content-Length is measured in bytes and not characters; imagine trying to read a multipart/form-data binary file upload submission crunched into UTF-8-decoded strings, or return a binary file download...)

So instead use sys.stdin.buffer and sys.stdout.buffer to get the raw byte streams for stdio, and read/write binary with them. It is up to the form-reading layer to convert those bytes into Unicode string parameters where appropriate using whichever encoding your web page has.

Unfortunately the standard library CGI and WSGI interfaces don't get this right in Python 3.1: the relevant modules were crudely converted from the Python 2 originals using 2to3 and consequently there are a number of bugs that will end up in UnicodeError.

The first version of Python 3 that is usable for web applications is 3.2. Using 3.0/3.1 is pretty much a waste of time. It took a lamentably long time to get this sorted out and PEP3333 passed.