Possible to share in-memory data between 2 separate processes?

Without some deep and dark rewriting of the Python core runtime (to allow forcing of an allocator that uses a given segment of shared memory and ensures compatible addresses between disparate processes) there is no way to "share objects in memory" in any general sense. That list will hold a million addresses of tuples, each tuple made up of addresses of all of its items, and each of these addresses will have be assigned by pymalloc in a way that inevitably varies among processes and spreads all over the heap.

On just about every system except Windows, it's possible to spawn a subprocess that has essentially read-only access to objects in the parent process's space... as long as the parent process doesn't alter those objects, either. That's obtained with a call to os.fork(), that in practice "snapshots" all of the memory space of the current process and starts another simultaneous process on the copy/snapshot. On all modern operating systems, this is actually very fast thanks to a "copy on write" approach: the pages of virtual memory that are not altered by either process after the fork are not really copied (access to the same pages is instead shared); as soon as either process modifies any bit in a previously shared page, poof, that page is copied, and the page table modified, so the modifying process now has its own copy while the other process still sees the original one.

This extremely limited form of sharing can still be a lifesaver in some cases (although it's extremely limited: remember for example that adding a reference to a shared object counts as "altering" that object, due to reference counts, and so will force a page copy!)... except on Windows, of course, where it's not available. With this single exception (which I don't think will cover your use case), sharing of object graphs that include references/pointers to other objects is basically unfeasible -- and just about any objects set of interest in modern languages (including Python) falls under this classification.

In extreme (but sufficiently simple) cases one can obtain sharing by renouncing the native memory representation of such object graphs. For example, a list of a million tuples each with sixteen floats could actually be represented as a single block of 128 MB of shared memory -- all the 16M floats in double-precision IEEE representation laid end to end -- with a little shim on top to "make it look like" you're addressing things in the normal way (and, of course, the not-so-little-after-all shim would also have to take care of the extremely hairy inter-process synchronization problems that are certain to arise;-). It only gets hairier and more complicated from there.

Modern approaches to concurrency are more and more disdaining shared-anything approaches in favor of shared-nothing ones, where tasks communicate by message passing (even in multi-core systems using threading and shared address spaces, the synchronization issues and the performance hits the HW incurs in terms of caching, pipeline stalls, etc, when large areas of memory are actively modified by multiple cores at once, are pushing people away).

For example, the multiprocessing module in Python's standard library relies mostly on pickling and sending objects back and forth, not on sharing memory (surely not in a R/W way!-).

I realize this is not welcome news to the OP, but if he does need to put multiple processors to work, he'd better think in terms of having anything they must share reside in places where they can be accessed and modified by message passing -- a database, a memcache cluster, a dedicated process that does nothing but keep those data in memory and send and receive them on request, and other such message-passing-centric architectures.


mmap.mmap(0, 65536, 'GlobalSharedMemory')

I think the tag ("GlobalSharedMemory") must be the same for all processes wishing to share the same memory.

http://docs.python.org/library/mmap.html

Tags:

Python