Wrapping a python class around JSON data, which is better?

(note: this got an update, I'm now using dataclasses with run-time type enforcment. see bottom :3)

So, it's been a year and I'm going to answer my own question. I don't quite like answering it myself, but: this will mark the thread as resolved which in itself might help others.

On the other hand, I want to document and give reason to why I chose my solution over proposed answers. Not, to prove me right, but to highlight the different tradeoffs.

I just realized, that this got quite long, so:

tl;dr

collections.abc contains powerful abstractions and you should use them if you have access to it (cpython >= 3.3). @property is nice to use, enables to add documentation easily and provides read only access. Nested classes look weird but replicate the structure of deeply nested JSON just fine.

Proposed solutions

python meta-classes

So first off: I love the concept. I've considered many applications for where they prove useful, especially when:

  1. writing a pluggable API where meta-classes enforce correct usage of derived classes and their implementation specifics
  2. having a fully automated registry of classes that derive a from a meta-class.

On the other hand, python's meta-class logic felt obscure to wrap my head around (took me at least three days to figure it out). While simple in principle, the devil is in the details. So, I decided against it, simply because I might abandon the project in the not so far future and others should be able to pick up where I left off easily.

namedtuple

collections.namedtuple is very efficient and concise enough to boil my solution down to several lines instead of the current 800+ lines. My IDE will also be able to introspect possible members of the generated class.

Cons: the breverity of namedtuple leaves much less room for the awfully necessary documentation of the APIs returned values. So with less insane APIs you will possibly get away with just that. It also feels wierd to nest class objects into the namedtuple, but that's just personal preference.

What I went with

So in the end, I chose to stick to my first original solution with a few minor details added, if you find the details interesting, you can look at the source on github.

collections.abc

When I started the project, my python knowledge was next to none, so I went with what I knew about python ("everything is a dict") and wrote code like that. For example: classes that work like a dict, but have a file structure underneath (that was before pathlib).

While looking through python's code I noticed how they implement and enforce container "traits" through abstract base classes which sounds far more complicated than it really is in python.

the very basics

The following is indeed very basic, but we'll build up from there.

from collections import Mapping, Sequence, Sized

class JsonWrapper(Sized):
    def __len__(self):
        return len(self._data)

    def __init__(self, json):
        self._data = json

    @property
    def raw(self): return self._data

The most basic class I could come up with, this will just enable you to call len on the container. You also can get read-only access through raw if you really want to bother with the underlying dictionary.

So why am I inheriting from Sized instead of just starting from scratch and def __len__ just like that?

  1. not overriding __len__ will not be accepted by the python interpreter. I forget when exactly, but AFAIR it's when you import the module that contains the class, so you're not getting screwed at runtime.
  2. While Sized does not provide any mixin methods, the next two abstractions do provide them. I'll explain there.

With that down, we only got two more basic cases in JSON lists and dicts.

Lists

So, with the API I had to worry about, we we're not always sure what we got; so I wanted a way of checking if I got a list when we initialize the wrapper class, mostly to abort early instead of "object has no member" during more complicated processes.

Deriving from Sequence will enforce overriding __getitem__ and __len__ (which is already implemented in JsonWrapper).

class JsonListWrapper(JsonWrapper, Sequence):
    def __init__(self, json_list):
        if type(json_list) is not list:
            raise TypeError('received type {}, expected list'.format(type(json_list)))
        super().__init__(json_list)

    def __getitem__(self, index):
        return self._data[index]

    def __iter__(self):
        raise NotImplementedError('__iter__')

    def get(self, index):
        try:
            return self._data[index]
        except Exception as e:
            print(index)
            raise e

So you might have noted, that I chose to not implement __iter__. I wanted an iterator that yielded typed objects, so my IDE is able to autocomplete. To illustrate:

class CourseListResponse(JsonListWrapper):
    def __iter__(self):
        for course in self._data:
            yield self.Course(course)

    class Course(JsonDictWrapper):
        pass  # for now

Implementing the abstract methods of Sequence, the mixin methods __contains__, __reversed__, index and count are gifted to you, so you don't have to worry about possible side-effects.

Dictionaries

To complete the basic types to wrangle JSON, here's the class derived from Mapping:

class JsonDictWrapper(JsonWrapper, Mapping):
    def __init__(self, json_dict):
        super().__init__(json_dict)
        if type(self._data) is not dict:
            raise TypeError('received type {}, expected dict'.format(type(json_dict)))

    def __iter__(self):
        return iter(self._data)

    def __getitem__(self, key):
        return self._data[key]

    __marker = object()

    def get(self, key, default=__marker):
        try:
            return self._data[key]
        except KeyError:
            if default is self.__marker:
                raise
            else:
                return default

Mapping only enforces __iter__, __getitem__ and __len__. To avoid confusion: There is also MutableMapping which will enforce the writing methods. But that's neither needed nor wanted here.

With the abstract methods out of the way, python provides the mixins __contains__, keys, items, values, get, __eq__, and __ne__ based on them.

I'm not sure why I chose to override the get mixin, I might update the post when it get's back to me. __marker serves as a fallback to detect if the default keyword was not set. If somebody decided to call get(*args, default=None) you won't be able to detect that otherwise.

So to pick up the previous example:

class CourseListResponse(JsonListWrapper):
    # [...]    
    class Course(JsonDictWrapper):
        # Jn is just a class that contains the keys for JSON, so I only mistype once.
        @property
        def id(self): return self[Jn.id]

        @property
        def short_name(self): return self[Jn.short_name]

        @property
        def full_name(self): return self[Jn.full_name]

        @property
        def enrolled_user_count(self): return self[Jn.enrolled_user_count]
        # [...] you get the idea

The properties provide read-only access to members and can be documented like a function definition. Altough verbose, for basic accessors you can easily define a template in your editor, so it's less tedious to write.

Properties also allow to abstract from magic numbers and optional JSON return values, to provide defaults instead guarding for KeyError everywhere:

        @property
        def isdir(self): return 1 == self[Jn.is_dir]

        @property
        def time_created(self): return self.get(Jn.time_created, 0)

        @property
        def file_size(self): return self.get(Jn.file_size, -1)

        @property
        def author(self): return self.get(Jn.author, "")

        @property
        def license(self): return self.get(Jn.license, "")

class nesting

It seems a little weird to nest classes in others. I chose to do that, becaue the API uses the same name for various objects with different attributes, depending on which remote function you called.

Another benefit: new people can easily understand the structure of the returned JSON.

The end of the file contains various aliases to the nested classes for easier access from outside the module.

adding logic

Now that we have encapsulated most of the returned values, I wanted to have more logic associated with the data, to add some convenience. It also seemed necessary to merge some of the data into a more comprehensive tree that contained all of the data gathered through several API calls:

  1. get all "assignments". each assignment contains many submissions, so:
  2. for(assignment in assigmnents) get all "submissions"
  3. merge submissions into respective assignment.
  4. now get grades for the submissions, and so on...

I chose to implement them seperately, so I just inherited from the "dumb" accessors (full source):

So in this class

class Assignment(MoodleAssignment):
    def __init__(self, data, course=None):
        super().__init__(data)
        self.course = course
        self._submissions = {}  # accessed via submission.id
        self._grades = {}  # are accessed via user_id

these properties do the merging

    @property
    def submissions(self): return self._submissions

    @submissions.setter
    def submissions(self, data):
        if data is None:
            self.submissions = {}
            return
        for submission in data:
            sub = Submission(submission, assignment=self)
            if sub.has_content:
                self.submissions[sub.id] = sub

    @property
    def grades(self):
        return self._grades

    @grades.setter
    def grades(self, data):
        if data is None:
            self.grades = {}
            return
        grades = [Grade(g) for g in data]
        for g in grades:
            self.grades[g.user_id] = g

and these implement some logic that can be abstracted from the data.

    @property
    def is_due(self):
        now = datetime.now()
        return now > self.due_date

    @property
    def due_date(self): return datetime.fromtimestamp(super().due_date)

While the setters obscure the wrangling, they are nice to write and use: so it's just a trade-off.

Caveat: The logic implementation is not quite what I want it to be, there's much interdependance where it should not be. It's grown from me not knowing enough of python to get the abstractions right and getting things done, so I can do the actual work with the tedium out of my way. Now that I know, what could have been done: I look at some of that spaghetti, and well … you know the feeling.

Conclusion

Encapsulating the JSON into classes proved quite useful to me and the project's structue and I'm quite happy with it. The rest of the project is fine and works, although some parts are just awful :D Thank you all for the feedback, I'll be around for questions and remarks.

update: 2019-05-02

As @RickTeachey points out in the comments, pythons dataclasses (DCs) can be used here, as well. And I forgot to put an update here, since I already did that some time ago and extended it with pythons typing functionality :D

Reason for that: I was growing tired to manually check if the documentation of the API I was abstracting from was correct or if I got my implementation wrong. With dataclasses.fields I'm able to check if the response does conform to my schema; and now I'm able to find changes in the external API much faster, since the assumptions are checked during run-time on instantiation.

DCs provide a __post_init__(self) hook to do some post-processing once the __init__ completed successfully. Pythons' type hints are only in place to provide hints for static checkers, I built a little system that does enforce the types on dataclasses in the post init phase.

Here is the BaseDC, from which all other DCs inherit (abbreviated)

import dataclasses as dc
@dataclass
class BaseDC:
    def _typecheck(self):
        for field in dc.fields(self):
            expected = field.type
            f = getattr(self, field.name)
            actual = type(f)
            if expected is list or expected is dict:
                log.warning(f'untyped list or dict in {self.__class__.__qualname__}: {field.name}')
            if expected is actual:
                continue
            if is_generic(expected):
                return self._typecheck_generic(expected, actual)
                # Subscripted generics cannot be used with class and instance checks
            if issubclass(actual, expected):
                continue
            print(f'mismatch {field.name}: should be: {expected}, but is {actual}')
            print(f'offending value: {f}')

    def __post_init__(self):
        for field in dc.fields(self):
            castfunc = field.metadata.get('castfunc', False)
            if castfunc:
                attr = getattr(self, field.name)
                new = castfunc(attr)
                setattr(self, field.name, new)
        if DEBUG:
            self._typecheck()

Fields have an additional attribute that is allowed to store arbitary information, I'm using it to store functions that convert the response value; but more on that later.

A basic response wrapper looks like this:

@dataclass
class DCcore_enrol_get_users_courses(BaseDC):
    id: int  # id of course
    shortname: str  # short name of course
    fullname: str  # long name of course
    enrolledusercount: int  # Number of enrolled users in this course
    idnumber: str  # id number of course
    visible: int  # 1 means visible, 0 means hidden course
    summary: Optional[str] = None  # summary
    summaryformat: Optional[int] = None  # summary format (1 = HTML, 0 = MOODLE, 2 = PLAIN or 4 = MARKDOWN)
    format: Optional[str] = None  # course format: weeks, topics, social, site
    showgrades: Optional[int] = None  # true if grades are shown, otherwise false
    lang: Optional[str] = None  # forced course language
    enablecompletion: Optional[int] = None  # true if completion is enabled, otherwise false
    category: Optional[int] = None  # course category id
    progress: Optional[float] = None  # Progress percentage
    startdate: Optional[int] = None  # Timestamp when the course start
    enddate: Optional[int] = None  # Timestamp when the course end

    def __str__(self): return f'{self.fullname[0:39]:40} id:{self.id:5d} short: {self.shortname}'


core_enrol_get_users_courses = destructuring_list_cast(DCcore_enrol_get_users_courses)

Responses that are just lists were giving me trouble in the beginning, since I could not enforce type checking on them with a plain List[DCcore_enrol_get_users_courses]. This is where the destructuring_list_cast solves that problem for me, which is a little more involved. We're entering higher order function territory:

T = typing.TypeVar('T')
def destructuring_list_cast(cls: typing.Callable[[dict], T]) -> typing.Callable[[list], T]:
    def cast(data: list) -> List[T]:
        if data is None:
            return []

        if not isinstance(data, list):
            raise SystemExit(f'listcast expects a list, you sent: {type(data)}')
        try:
            return [cls(**entry) for entry in data]
        except TypeError as err:
            # here is more code that explains errors
            raise SystemExit(f'listcast for class {cls} failed:\n{err}')

    return cast

This expects a Callable that accepts a dict and returns a class instance of type T, which is something what you'd expect from a constructor or a factory. It returns a Callable that will accept a list, here it's cast. return [cls(**entry) for entry in data] does all the work here, by constructing a list of dataclasses, when you call core_enrol_get_users_courses(response.json()). (Throwing SystemExit is not nice, but that's handled in the upper layers, so it works for me; I want that to fail hard and fast.)

It's other use case is to define nested fields, then the responses are deeply nested: remember the field.metadata.get('castfunc', False) in the BaseDC? That's where these two shortcuts come in:

# destructured_cast_field
def dcf(cls):
    return dc.field(metadata={'castfunc': destructuring_list_cast(cls)})


def optional_dcf(cls):
    return dc.field(metadata={'castfunc': destructuring_list_cast(cls)}, default_factory=list)

These are used in nested cases like this (see bottom):

@dataclass
class core_files_get_files(BaseDC):
    @dataclass
    class parent(BaseDC):
        contextid: int
        # abbrev ...

    @dataclass
    class file(BaseDC):
        contextid: int
        component: str
        timecreated: Optional[int] = None  # Time created
        # abbrev ...

    parents: List[parent] = dcf(parent)
    files: Optional[List[file]] = optional_dcf(file)