How do I properly create custom text codecs?

You asked for minimal!

  • Write a encode function and a decode function.
  • Write a "search function" that returns a CodecInfo object constructed from the above encoder and decoder.
  • Use codec.register to register a function that returns the above CodecInfo object.

Here is an example that converts the lowercase letters a-z to 0-25 in order.

import codecs
import string

from typing import Tuple

# prepare map from numbers to letters
_encode_table = {str(number): bytes(letter, 'ascii') for number, letter in enumerate(string.ascii_lowercase)}

# prepare inverse map
_decode_table = {ord(v): k for k, v in _encode_table.items()}


def custom_encode(text: str) -> Tuple[bytes, int]:
    # example encoder that converts ints to letters
    # see https://docs.python.org/3/library/codecs.html#codecs.Codec.encode
    return b''.join(_encode_table[x] for x in text), len(text)


def custom_decode(binary: bytes) -> Tuple[str, int]:
    # example decoder that converts letters to ints
    # see https://docs.python.org/3/library/codecs.html#codecs.Codec.decode
    return ''.join(_decode_table[x] for x in binary), len(binary)


def custom_search_function(encoding_name):
    return codecs.CodecInfo(custom_encode, custom_decode, name='Reasons')


def main():

    # register your custom codec
    # note that CodecInfo.name is used later
    codecs.register(custom_search_function)

    binary = b'abcdefg'
    # decode letters to numbers
    text = codecs.decode(binary, encoding='Reasons')
    print(text)
    # encode numbers to letters
    binary2 = codecs.encode(text, encoding='Reasons')
    print(binary2)
    # encode(decode(...)) should be an identity function
    assert binary == binary2

if __name__ == '__main__':
    main()

Running this prints

$ python codec_example.py
0123456
b'abcdefg'

See https://docs.python.org/3/library/codecs.html#codec-objects for details on the Codec interface. In particular, the decode function

... decodes the object input and returns a tuple (output object, length consumed).

whereas the encode function

... encodes the object input and returns a tuple (output object, length consumed).

Note that you should also worry about handling streams, incremental encoding/decoding, as well as error handling. For a more complete example, refer to the hexlify codec that @krs013 mentioned.


P.S. instead of of codec.decode, you can also use codec.open(..., encoding='Reasons').


While the online documentation is certainly sparse, you can get a lot more information by looking at the source code. The docstrings and comments are quite clear, and the definitions for the parent classes (Codec, IncrementalEncoder, etc.) are ready to be copy/pasted for a start to your codec (be sure to replace the object in each class definition with the name of the class you're inheriting from). It's also worth looking at the example I linked to in the comments for how to assemble/register it.

I've been stuck at the same point as you for a while looking through this, so good luck! If I have time in a few days, I'll see about actually making that implementation and pasting/linking to it here.