How to completely sanitize a string of illegal characters in python?

You can pass, "ignore" to skip invalid characters in .encode/.decode like "ILLEGAL".decode("utf8","ignore")

>>> "ILLEGA\xa0L".decode("utf8")
...
UnicodeDecodeError: 'utf8' codec can't decode byte 0xa0 in position 6: unexpected code byte

>>> "ILLEGA\xa0L".decode("utf8","ignore")
u'ILLEGAL'
>>>

Declare the coding on the second line of your script. It really has to be second. Like

#!/usr/bin/python
# coding=utf-8

This might be enough to solve your problem all by itself. If not, see str.encode('utf-8') and str.decode('utf-8').

Tags:

Python

Unicode