Python tips for memory optimization

I've had situations where I've had a collection of large objects that I've needed to sort and filter by different methods based on several metadata properties. I didn't need the larger parts of them so I dumped them to disk.

As you data is so simple in type, a quick SQLite database might solve all your problems, even speed things up a little.


If you want to stay with the in-memory data store, you could try something like memcached.

That way, you can use a single in-memory key/value-store from all the Python processes.

There are several python memcached client libraries.


For a web application you should use a database, the way you're doing it you are creating one copy of your dict for each apache process, which is extremely wasteful. If you have enough memory on the server the database table will be cached in memory (if you don't have enough for one copy of your table, put more RAM into the server). Just remember to put correct indices on your database table or you will get bad performance.


I suggest the following: store all the values in a DB, and keep an in-memory dictionary with string hashes as keys. If a collision occurs, fetch values from the DB, otherwise (vast majority of the cases) use the dictionary. Effectively, it will be a giant cache.

A problem with dictionaries in Python is that they use a lot of space: even an int-int dictionary uses 45-80 bytes per key-value pair on a 32-bit system. At the same time, an array.array('i') uses only 8 bytes per a pair of ints, and with a little bit of bookkeeping one can implement a reasonably fast array-based int → int dictionary.

Once you have a memory-efficient implementation of an int-int dictionary, split your string → (object, int, int) dictionary into three dictionaries and use hashes instead of full strings. You'll get one int → object and two int → int dictionaries. Emulate the int → object dictionary as follows: keep a list of objects and store indexes of the objects as values of an int → int dictionary.

I do realize there's a considerable amount of coding involved to get an array-based dictionary. I had had problem similar to yours and I have implemented a reasonably fast, very memory-efficient, generic hash-int dictionary. Here's my code (BSD license). It is array-based (8 bytes per pair), it takes care of key hashing and collision checking, it keeps the array (several smaller arrays, actually) ordered during writes and does binary search on reads. Your code is reduced to something like:

dictionary = HashIntDict(checking = HashIntDict.CHK_SHOUTING)
# ...
database.store(k, v)
try:
    dictionary[k] = v
except CollisionError:
    pass
# ...
try:
    v = dictionary[k]
except CollisionError:
    v = database.fetch(k)

The checking parameter specifies what happens when a collision occurs: CHK_SHOUTING raises CollisionError on reads and writes, CHK_DELETING returns None on reads and remains silent on writes, CHK_IGNORING doesn't do collision checking.

What follows is a brief description of my implementation, optimization hints are welcome! The top-level data structure is a regular dictionary of arrays. Each array contains up to 2^16 = 65536 integer pairs (square root of 2^32). A key k and a corresponding value v are both stored in k/65536-th array. The arrays are initialized on-demand and kept ordered by keys. Binary search is executed on each read and write. Collision checking is an option. If enabled, an attempt to overwrite an already existing key will remove the key and associated value from the dictionary, add the key to a set of colliding keys, and (again, optionally) raise an exception.