Can't open Unicode URL with Python

This works for me:

#!/usr/bin/env python
# define source file encoding, see: http://www.python.org/dev/peps/pep-0263/
# -*- coding: utf-8 -*-

import urllib
url = u'http://example.com/índice.html'
content = urllib.urlopen(url.encode("UTF-8")).read()

Encoding the URL as utf-8, should have worked. I wonder if your source file is properly encoded, and whether the interpreter knows it. If your python source file is saved as UTF-8, for example, then you should have

# coding=UTF-8

as the first or second line.

import urllib
url = u'http://mydomain.es/índice.html'
content = urllib.urlopen(url.encode('utf-8')).read()

works for me.

Edit: also, be aware that Unicode text in an interactive Python session (whether through IDLE, or a console) is fraught with encoding-related difficulty. In those cases, you should use Unicode literals (like \u00ED in your case).


Per the applicable standard, RFC 1378, URLs can only contain ASCII characters. Good explanation here, and I quote:

"...Only alphanumerics [0-9a-zA-Z], the special characters "$-_.+!*'()," [not including the quotes - ed], and reserved characters used for their reserved purposes may be used unencoded within a URL."

As the URLs I've given explain, this probably means you'll have to replace that "lowercase i with acute accent" with `%ED'.

Tags:

Python

Unicode