python pandas dataframe thread safe?

No, pandas is not thread safe. And its not thread safe in surprising ways.

Can I delete from pandas dataframe while another thread is using?

Fuggedaboutit! Nope. And generally no. Not even for GIL-locked python datastructures.

Can I read from a pandas object while someone else is writing to it?
Can I copy a pandas dataframe in my thread, and work on the copy?

Definitely not. There's a long standing open issue: https://github.com/pandas-dev/pandas/issues/2728

Actually I think this is pretty reasonable (i.e. expected) behavior. I wouldn't expect to be able to simultaneouls write and read from, or copy, any datastructure unless either: i) it had been designed for concurrency, or ii) I have an exclusive lock on that object and all the view objects derived from it (.loc, .iloc are views and pandas has may others).

Can I read from a pandas object while no-one else is writing to it?

For almost all data structures in Python, the answer is yes. For pandas, no. And it seems, its not a design goal at present.

Typically, you can perform 'reading' operations on objects if no-one is performing mutating operations. You have to be a little cautious though. Some datastructures, including pandas, perform memoization, to cache expensive operations that are otherwise functionally pure. Its generally easy to implement lockless memoization in Python:

@property
def thing(self):
    if _thing is MISSING:
        self._thing = self._calc_thing()
    return self._thing

... it simple and safe (assuming assignment is safely atomic -- which has not always been the case for every language, but is in CPython, unless you override __setattribute__).

Pandas, series and dataframe indexes are computed lazily, on first use. I hope (but I do not see guarantees in the docs), that they're done in a similar safe way.

For all libraries (including pandas) I would hope that all types of read-only operations (or more specifically, 'functionally pure' operations) would be thread safe if no-one is performing mutating operations. I think this is a 'reasonable' easily-achievable, common, lower-bar for thread safeness.

For pandas, however, you cannot assume this. Even if you can guarantee no-one is performing 'functionally impure' operations on your object (e.g. writing to cells, adding/deleting columns'), pandas is not thread safe.

Here's a recent example: https://github.com/pandas-dev/pandas/issues/25870 (its marked as a duplicate of the .copy-not-threadsafe issue, but it seems it could be a separate issue).

s = pd.Series(...)
f(s)  # Success!

# Thread 1:
   while True: f(s)  

# Thread 2:
   while True: f(s)  # Exception !

... fails for f(s): s.reindex(..., copy=True), which returns it's result a as new object -- you would think it would be functionally pure and thread safe. Unfortunately, it is not.

The result of this is that we could not use pandas in production for our healthcare analytics system - and I now discourage it for internal development since it makes in-memory parallelization of read-only operations unsafe. (!!)

The reindex behavior is weird and surprising. If anyone has ideas about why it fails, please answer here: What's the source of thread-unsafety in this usage of pandas.Series.reindex(, copy=True)?

The maintainers marked this as a duplicate of https://github.com/pandas-dev/pandas/issues/2728 . I'm suspicious, but if .copy is the source, then almost all of pandas is not thread safe in any situation (which is their advice).

The data in the underlying ndarrays can be accessed in a threadsafe manner, and modified at your own risk. Deleting data would be difficult as changing the size of a DataFrame usually requires creating a new object. I'd like to change this at some point in the future.

python pandas dataframe thread safe?

Tags:

Python

Pandas

Thread Safety

Related

Recent Posts