Type Annotating Pandas DataFrames

I do this for dataframes in docstrings where it's reasonable. Sometimes it's not reasonable.

:param dataframe: pd.DataFrame [M x (3+N)]
    'id': int
        ID column
    'value': int
        Number of things
    'color': str
        Color of things
    Remaining columns are properties; all should be float64s

There's probably a better way to do this, but I haven't found it.

I have tried @Xukrao's method. To have a summary table is really nice.

Also inspired by another question in stackoverflow, to use the csv-table block is more convenient in terms of modification. Don't have to worry about alignment and "=". For example:

intra_edges (DataFrame): correspondence between intra-edges in
    planar graph and in multilayer graph.

    .. csv-table::
        :header: name, dtype, definition

        source_original (index), object, target in planar graph
        target_original (index), object, target in planar graph
        source, object, current source bus
        target, object, current target bus

inter_edges (DataFrame): correspondence between inter-nodes in
    planar graph and inter-edges in multilayer graph.

    ======  =======  ============================  ==========
    name    dtype    definition                    is_index
    ======  =======  ============================  ==========
    node    object   name in planar graph          True
    upper   int64    integer index of upper layer  False
    lower   int64    integer index of lower layer  False
    source  object   source node in supra graph    False
    target  object   target node in supra graph    False
    ======  =======  ============================  ==========

Docstring format

I use the numpy docstring convention as a basis. If a function's input parameter or return parameter is a pandas dataframe with predetermined columns, then I add a reStructuredText-style table with column descriptions to the parameter description. As an example:

def random_dataframe(no_rows):
    """Return dataframe with random data.

    Parameters
    ----------
    no_rows : int
        Desired number of data rows.

    Returns
    -------
    pd.DataFrame
        Dataframe with with randomly selected values. Data columns are as follows:

        ==========  ==============================================================
        rand_int    randomly chosen whole numbers (as `int`)
        rand_float  randomly chosen numbers with decimal parts (as `float`)
        rand_color  randomly chosen colors (as `str`)
        rand_bird   randomly chosen birds (as `str`)
        ==========  ==============================================================

    """
    df = pd.DataFrame({
        "rand_int": np.random.randint(0, 100, no_rows),
        "rand_float": np.random.rand(no_rows),
        "rand_color": np.random.choice(['green', 'red', 'blue', 'yellow'], no_rows),
        "rand_bird": np.random.choice(['kiwi', 'duck', 'owl', 'parrot'], no_rows),
    })

    return df

Bonus: sphinx compatibility

The aforementioned docstring format is compatible with the sphinx autodoc documentation generator. This is how the docstring looks like in HTML documentation that was automatically generated by sphinx (using the nature theme):

sphinx docstring

Type Annotating Pandas DataFrames

Docstring format

Bonus: sphinx compatibility

Tags:

Python

Pandas

Related

Recent Posts