Dictionary size reduces upon increasing one element

sys.getsizeof returns the memory allocated to the underlying hash table implementation of those dictionaries, which has a somewhat non-obvious relationship with the actual size of the dictionary.

The CPython implementation of Python 2.7 quadruples the amount of memory allocated to a hash table each time it's filled up to 2/3 of its capacity, but shrinks it if it has over allocated memory to it (i.e. a large contiguous block of memory has been allocated but only a few addresses were actually used).

It just so happens that dictionaries that have between 8 and 11 elements allocate just enough memory for CPython to consider them 'over-allocated', and get shrunk.

Previous answers have already mentioned that you needn't worry, so I will dive into some more technical details. It's long, but please bear with me.

TLDR: this has to do with arithmetic of resizing. Each resize allocates 2**i memory, where 2**i > requested_size; 2**i >= 8, but then each insert resizes the underlying table further if 2/3 of slots are filled, but this time the new_size = old_size * 4. This way, your first dictionary ends up with 32 cells allocated while the second one with as little as 16 (as it got a bigger initial size upfront).

Answer: As @snakecharmerb noted in the comments this depends on the way that the dictionary is created. For the sake of brevity, let me refer you to this, excellent blog post which explains the differences between the dict() constructor and the dict literal {} on both Python bytecode and CPython implementation levels.

Let's start with the magic number of 8 keys. It turns out to be a constant, predefined for Python's 2.7 implementation in dictobject.h headers file - the minimal size of the Python dictionary:

/* PyDict_MINSIZE is the minimum size of a dictionary.  This many slots are
 * allocated directly in the dict object (in the ma_smalltable member).
 * It must be a power of 2, and at least 4.  8 allows dicts with no more
 * than 5 active entries to live in ma_smalltable (and so avoid an
 * additional malloc); instrumentation suggested this suffices for the
 * majority of dicts (consisting mostly of usually-small instance dicts and
 * usually-small dicts created to pass keyword arguments).
 */
#define PyDict_MINSIZE 8

As such, it may differ between the specific Python implementations, but let's assume that we all use the same CPython version. However, the dict of size 8 is expected to neatly contain only 5 elements; don't worry about this, as this specific optimization is not as important for us as it seems.

Now, when you create the dictionary using the dict literal {}, CPython takes a shortcut (as compared to the explicit creation when calling dict constructor). Simplifying a bit the bytecode operation BUILD_MAP gets resolved and it results in calling the _PyDict_NewPresized function which will construct a dictionary for which we already know the size upfront:

/* Create a new dictionary pre-sized to hold an estimated number of elements.
   Underestimates are okay because the dictionary will resize as necessary.
   Overestimates just mean the dictionary will be more sparse than usual.
*/

PyObject *
_PyDict_NewPresized(Py_ssize_t minused)
{
    PyObject *op = PyDict_New();

    if (minused>5 && op != NULL && dictresize((PyDictObject *)op, minused) == -1) {
        Py_DECREF(op);
        return NULL;
    }
    return op;
}

This function calls the normal dict constructor (PyDict_New) and requests a resize of the newly created dict - but only if it is expected to hold more than 5 elements. This is due to an optimization which allows Python to speed up some things by holding the data in the pre-allocated "smalltable", without invoking expensive memory allocation and de-allocation functions.

Then, the dictresize will try to determine the minimal size of the new dictionary. It will also use the magic number 8 - as the starting point and iteratively multiply by 2 until it finds the minimal size larger than the requested size. For the first dictionary, this is simply 8, however, for the second one (and all dictionaries created by dict literal with less than 15 keys) it is 16.

Now, in the dictresize function there is a special case for the former, smaller new_size == 8, which is meant to bring forward the aforementioned optimization (using the "small table" to reduce memory manipulation operations). However, because there is no need to resize the newly created dict (e.g. no elements were removed so far thus the table is "clean") nothing really happens.

On the contrary, when the new_size != 8, a usual procedure of reallocating the hash table follows. This ends up with a new table being allocated to store the "big" dictionary. While this is intuitive (the bigger dict got a bigger table), this does not seem to move us forward to the observed behavior yet - but, please bear with me one more moment.

Once we have the pre-allocated dict, STORE_MAP optcodes tell the interpreter to insert consecutive key-value pairs. This is implemented with dict_set_item_by_hash_or_entry function, which - importantly - resizes the dictionary after each increase in size (i.e. successful insertion) if more than 2/3 of the slots are already used up. The size will increase x4 (in our case, for large dicts only by x2).

So here is what happens when you create the dict with 7 elements:

# note 2/3 = 0.(6)
BUILD_MAP   # initial_size = 8, filled = 0
STORE_MAP   # 'key_1' ratio_filled = 1/8 = 0.125, not resizing
STORE_MAP   # 'key_2' ratio_filled = 2/8 = 0.250, not resizing
STORE_MAP   # 'key_3' ratio_filled = 3/8 = 0.375, not resizing
STORE_MAP   # 'key_4' ratio_filled = 4/8 = 0.500, not resizing
STORE_MAP   # 'key_5' ratio_filled = 5/8 = 0.625, not resizing
STORE_MAP   # 'key_6' ratio_filled = 6/8 = 0.750, RESIZING! new_size = 8*4 = 32
STORE_MAP   # 'key_7' ratio_filled = 7/32 = 0.21875

And you end up with a dict having a total size of 32 elements in the hash table.

However, when adding eight elements the initial size will be twice bigger (16), thus we will never resize as the condition ratio_filled > 2/3 will never be satisfied!

And that's why you end up with a smaller table in the second case.

Dictionary size reduces upon increasing one element

Tags:

Python

Dictionary

Python 2.7

Related

Recent Posts