Destruction of condition variable randomly loses notification

I am pretty sure your vendors implementation is broken. Your program looks almost OK from the perspective of obeying the contract with the cv/mutex classes. I couldn’t 100% verify, I am behind one version.

The notion of “blocking” is confusing in the condition_variable (CV) class because there are multiple things to be blocking on. The contract requires the implementation to be more complex than a veneer on pthread_cond* (for example). My reading of it indicates that a single CV would require at least 2 pthread_cond_t’s to implement.

The crux is the destructor having a definition while threads are waiting upon a CV; and its ruin is in a race between CV.wait and ~CV. The naive implementation simply has ~CV broadcast the condvar then eliminate it, and has CV.wait remember the lock in a local variable, so that when it awakens from the runtime notion of blocking it no longer has to reference the object. In that implementation, ~CV becomes a “fire and forget” mechanism.

Sadly, a racing CV.wait could meet the preconditions, yet not be finished interacting with the object yet, when ~CV sneaks in and destroys it. To resolve the race CV.wait and ~CV need to exclude each other, thus the CV requires at least a private mutex to resolve races.

We aren’t finished yet. There usually isn’t an underlying support [ eg. kernel ] for an operation like “wait on cv controlled by lock and release this other lock once I am blocked”. I think that even the posix folks found that too funny to require. Thus, burying a mutex in my CV isn’t enough, I actually require a mechanism that permits me to process events within it; thus a private condvar is required inside the implementation of CV. Obligatory David Parnas meme.

Almost OK, because as Marek R points out, you are relying on referencing a class after its destruction has begun; not the cv/mutex class, your notify_on_delete class. The conflict is a bit academic. I doubt clang would depend upon nod remaining valid after control had transferred to nod->cv.wait(); but the real customer of most compiler vendors are benchmarks, not programmers.

As as general note, multi-threaded programming is difficult, and having now peaked at the c++ threading model, it might be best to give it a decade or two to settle down. It’s contracts are astonishing. When I first looked at your program, I thought ‘duh, there is no way you can destroy a cv that can be accessed because RAII’. Silly me.

Pthreads is another awful API for threading. At least it doesn’t attempt over-reach, and is mature enough that robust test suites keep vendors in line.


When NOTIFY_IN_DESTRUCTOR is defined:
Calling notify_one()/notify_all() doesn't mean that the waiting thread is immediately woken up and the current thread will wait for the other thread. It just means that if the waiting thread wakes up at some point after the current thread has called notify, it should proceed. So in essence, you might be deleting the condition variable before the waiting thread wakes up (depending on how the threads are scheduled).

The explanation for why it hangs, even if the condition variable is deleted while the other thread is waiting on it lies on the fact the wait/notify operations are implemented using queues associated with the condition variables. These queues hold the threads waiting on the condition variables. Freeing the condition variable would mean getting rid of these thread queues.