Impossible race condition between threads?

theagentd · October 25, 2013, 2:07pm

Hello everyone!

I recently encountered an extremely rare race condition in my threading code. After 4 hours of debugging I finally managed to reproduce the error somewhat reliably and managed to track it down and fix it, but I still can’t wrap my head around WHY it happened in the first place.

The class in question is in charge of distributing tasks to worker threads. Since the total number of tasks was known, I simply kept an AtomicInteger counter to keep track of the number of completed tasks to be able to signal the main thread that all tasks had been completed. The problem was that the totalTasks field did not get synchronized, meaning that the value the atomic task counter was compared to was sometimes not updated. How can that possibly happen in the following code?



    private PriorityBlockingQueue<Task> taskQueue;

    protected AtomicInteger taskCounter;
    protected int totalTasks;

    public void run(TaskTree tree){ //A TaskTree contains tasks...
        taskCounter.set(0);
        totalTasks = tree.getNumTasks(); //This call here was sometimes not visible to the worker threads.
        for(Task t : tree.getTasks()){
            taskQueue.add(t); //This will cause the thread below to start processing.
        }
        //wait for completion signal that from worker thread(s)
    }

    //The worker thread run() method:
    public void run(){
        Task task;
            while (running) {
                try {
                    task = taskQueue.take();
                } catch (InterruptedException ex) {
                    break; //Closed, kill thread
                }
                
                task.process();
                if (taskCounter.incrementAndGet() == totalTasks) { //totalTasks sometimes had not been updated with the new value yet!
                    //Signal main thread that all tasks have been completed.
                }
            }
        }
    }

The code above is extremely simplified, so the taskCounter variable may seem redundant, etc. Anyway, the race condition only started occurring when I first ran a TaskTree with a single task (totalTasks = 1), and then ran a different TaskTree with 10 tasks. When the second TaskTree was run, a worker thread would sometimes process a single task, see that (taskCounter = totalTasks = 1) and signal that all tasks had been completed.

I “solved” this by counting backwards. By setting the taskCounter to totalTasks, decrementing instead of incrementing and comparing to 0 the race condition was avoided, but I still can’t fathom why the locking done by the task queue does not cause proper synchronization between the threads. How can the thread be woken up from awaiting a lock and then read the wrong value? I was under the impression that locking a lock would cause all memory to be synchronized properly each time a thread acquires that lock.

Thanks for reading!

65K · October 25, 2013, 2:19pm

Just from a quick glance:
Make totalTasks volatile or synchronize it.

Riven · October 25, 2013, 2:27pm

If you want help with code that is problematic don’t give us pseudo code :

The locking code, which is critical, is just… gone.

Just like with synchronized, only the variables in the critical section are guaranteed to be synced among threads. After a thread unlocks a lock, no more guarantees can be made. Even for variables you just accessed in the critical section. Their value is undefined.


synchronized(mutex) {
    // all access to shared variables is safe
}
// all bets are off again (!)

lock.lock();
try {
    // all access to shared variables is safe
}
finally {
   lock.unlock();
}
// all bets are off again (!)

Last but not least, AtomicInteger.set(n) is rather awkward in multi threaded code. What if multiple threads call .run(TaskTree) ? Just call AtomicInteger.incrementAndGet() prior to adding each task to the queue.

Anyway, I’d advise to use CountdownLatch. Your solution to count backwards is what I always did, and is guaranteed to work, but it’s what CountdownLatch does behind the scenes, including signalling the other thread(s) when the latch reaches zero. I however always make a new CountdownLatch for each run(TaskTree) operation, instead of stuffing it in a field, and simply attach the CountdownLatch to each task (can be a done with a holder object), and use latch.await() in the .run(TaskTree) method.

theagentd · October 25, 2013, 4:31pm

I didn’t remove anything important. The locking happens in PriorityBlockingQueue.

run(TaskTree) is not meant to be thread safe, but I’ll keep that in mind.

That seems like a good idea, but a problem is that the main thread actually also processes tasks. Since the OpenGL context belongs to the main thread all tasks that use OpenGL have to be run on that thread.

Riven:

Just like with synchronized, only the variables in the critical section are guaranteed to be synced among threads. After a thread unlocks a lock, no more guarantees can be made. Even for variables you just accessed in the critical section. Their value is undefined.
synchronized(mutex) {
    // all access to shared variables is safe
}
// all bets are off again (!)
lock.lock();
try {
    // all access to shared variables is safe
}
finally {
   lock.unlock();
}
// all bets are off again (!)

Now that is terrible. Once a task is completed it can unlock one or more new task which are then added to the queue. I was under the assumption that the synchronization in PriorityBlockingQueue would cover all memory, but you’re telling me that there’s a risk that if a task unlocks a new task which is picked up by a different thread than the first, there’s a risk that it may not see the changes made by the first task? I think I need to sit down for a while…

Riven · October 25, 2013, 4:38pm

Keep in mind that the CPU (!) and the JVM are allowed to reorder instructions, so whatever happens in-order in your source code, and bytecode, that doesn’t mean shit for actual behaviour. Only critical sections provide happens-before guarantees on the boundaries.

Right after you left the critical section, another thread/core/whatever may overwrite that field again with out of date data.

theagentd · October 25, 2013, 4:41pm

How would I guarantee that a task is run after its prerequisites then? I suppose what I’m looking for is a kind of memory barrier/fence… Or am I…? ???

That cannot happen in my case since once the critical section is left the thread won’t modify that data anymore. Or did I misunderstand something again…? T___T

Riven · October 25, 2013, 4:50pm

IIRC, back when you introduced your thread library on JGO, I browsed the code and thought it could never be guaranteed to work. I decided to keep quiet, as you seemed so sure of yourself.

With your current code I also don’t know how you could possibly handle tasks that are added to the queue while processing the queue, as they also increment/decrement your AtomicInteger, which will cause your blocking .run(TaskTree) to return way before all appended tasks have finished. It’s all very tricky, and the code you posted doesn’t take it into account at all…?

Riven · October 25, 2013, 4:56pm

Communication between threads/cores can happen at literally any time, which means stale data is ‘synchronized’, which means thread A and thread B can both know the field holds value 0 (at a specific point in time), but at one point (a bit earlier) thread A assigned it another value, and that value is only now being propagated to thread B. Causing thread B to see a stale value, outside of the critical section.

theagentd · October 25, 2013, 4:58pm

I’ll be damned… ._.

The total number of tasks in a task tree is known from the start. Before the worker threads are given any work through the task queue I call [icode]remainingTasks.set(tree.getNumTasks());[/icode]. The atomic counter is not incremented as tasks are added to the queue. The atomic counter can never reach zero unless all tasks are finished.

And there is no easy way to force a task to have a happens-before relationship with other tasks?

Riven · October 25, 2013, 5:00pm

Nothing is easy when dealing with concurrency. Immutable objects and message queues help.

As a dirty hack, you can give the ‘dependent task’ a lock that is used by the ‘parent task’, when modifying data. This shouldn’t cause any blocking, unless a task has multiple ‘children’.

theagentd · October 25, 2013, 5:09pm

Hmm…

So if all tasks synchronize all their prerequisites and on themselves they would be guaranteed to see the data of its required tasks. I still have a hard time accepting that that is necessary…

Riven · October 25, 2013, 5:13pm

You’d only have to lock on the parent (if any) and itself.

It’s a bit like:


synchronized(parent) {
   synchronized(this) {
       //
   }
}

but then preventing deadlocks by using Lock.tryLock to either acquire both locks or none.

Having said that, immutable objects and message queues are preferable.

theagentd · October 25, 2013, 5:51pm

Sorry, had dinner…

Riven:

You’d only have to lock on the parent (if any) and itself.

It’s a bit like:
synchronized(parent) {
   synchronized(this) {
       //
   }
}
but then preventing deadlocks by using Lock.tryLock to either acquire both locks or none.

Yes, exactly. But don’t I also have to make sure that all memory transactions that were done before calling run(TaskTree) have completed? And don’t I have to ensure that all code that comes after can see the changes made by the tasks? This is ridiculous. I’d have to put the whole goddamn game under one big lock. -_-

Riven · October 25, 2013, 5:56pm

Keep in mind CountDownLatch has happens-before guarantees, as long as all threads modifying data call coutdown(). But only the thread calling await() has this consistent view on the data modified by the worker threads.

http://docs.oracle.com/javase/6/docs/api/java/util/concurrent/package-summary.html#MemoryVisibility

theagentd · October 25, 2013, 6:02pm

http://docs.oracle.com/javase/7/docs/api/java/util/concurrent/package-summary.html#MemoryVisibility

[quote]Actions in a thread prior to placing an object into any concurrent collection happen-before actions subsequent to the access or removal of that element from the collection in another thread.
[/quote]
When a task is completed, the tasks it unlocks are inserted into the concurrent collection. When they are removed from the collection, they are guaranteed to happen “after” the unlocking task(s) have completed regardless of locking. W-what’s going on…? Once a task is taken out, it has to have a happens-after relationship with the tasks that happened before it thanks to this.

EDIT: To clarify, the problem still isn’t solved. The same logic as the one above implies that the race condition I got should not be possible either.

Roquen · October 26, 2013, 1:10pm

Locks suck. Deadlocks, races, priority inversion…meh.

I don’t understand why a worker thread is informing the main thread of anything. Why do it that way?

(EDIT: BTW - I did see the “might seem redundant” part…what are you really attempting to do with this is my question).

theagentd · October 26, 2013, 7:40pm

run(TaskTree) blocks until all tasks in the tree have been completed. That’s why the worker threads have to be able to signal this.

theagentd · October 26, 2013, 8:38pm

My attempts at reproducing the original error failed today. However, I believe I have found another problem with the assumptions I made when I wrote this concerning how certain tasks are executed. From what I know this problem should not be able to cause the original race condition, but it should be able to cause race conditions when a task requires several other tasks to complete before they’re allowed to be executed.

Basically each task has a counter of how many of its prerequisites have been completed. It’s an atomic int. When a required task is completed it increments this counter by 1 for all tasks that require the just completed task. Since only the last task interacts with the task queue (according to Oracle docs that guarantees correct order, see my above post), only the changes done by the task that finally unlocks the new task and adds it to the queue will be guaranteed to be visible to the new task. The other required tasks simply incremented an atomic counter, so some the unlocked task may not be able to see those changes. (I have never had any problems, but that’s a bad argument here…)

Applying this to the original problem

Due to new tasks being added to the queue being added BEFORE incrementing the taskCounter, it’s entirely possible for any task to be the last that “finishes” if the thread is not allowed to run anymore for some reason immediately after having queued up new tasks, so all tasks potentially need to be able to see what the main thread does.

The “root” tasks (the tasks available from the beginning) are added by the main thread, so the main thread should have a happens-before relationship with all the root tasks according to the Oracle docs. Assuming there is one more task that requires say 3 of the root tasks to complete before it can run, it’ll then be unlocked by the same thread that completes the last required task. That means that the thread that runs the new task is guaranteed to be able to see the work done by the unlocking thread (1 of the 3 required tasks), but not the other two required tasks (the first two simply incremented the unlocking counter). Although this is bad, it still means that it’ll always happen-after one of the tasks that happened-after the main thread.

It all becomes a mathematical proof that it should work correctly:

The root tasks happen-after the main thread.
If a task happens-after a task that happens after the main thread, it too happens after the main thread.
An unlocked task will happen-after ONE of its required task.
Conclusion: All tasks happen-after the main thread.

So, either my reasoning above is wrong, or my code does not work as I am thinking… Any ideas?

EDIT: I also don’t know why this new problem isn’t happening in practice. For example, I have tasks that are run X times (doing different things each run of course) and feature a finish() function that I use for merging the results.

Example:


AtomicInteger subtaskID = new AtomicInteger(0); //ID of each subtask to make them process different parts of a task.
AtomicInteger completedSubtasks = new AtomicInteger(0); //How many subtasks that have completed
int numSubtasks = 4; //How many subtasks the task is divided into
...

task.run(subtaskID.getAndIncrement());

if(completedSubtasks.incrementAndGet() == numSubtasks){
    //All subtasks have been completed, but there's no synchronization between the threads
    //that processed the different subtasks.

    task.mergeResultsFromSubtasks(); //This call accesses data modified by other threads and should result in race conditions a lot, but it never does.
}

I assume that updating the atomic counter in some way causes the subtask data to be synchronized as well, but that it’s just a fluke due to my use cases and/or how my processor works.

theagentd · October 27, 2013, 6:25pm

Triple posting, but I wanted to “close” this thread. :clue:

Since I couldn’t guarantee that my multithreading was thread-safe, I’ve decided to abandon my old multithreader. I’ve made a new one that has a bit more synchronization overhead but it should be 100% thread safe. Running 48 tasks with complex dependencies is now around 0.05 to 0.1ms slower, which is acceptable. The new multithreading system relies on a master thread that coordinates the worker threads and the draw thread. I no longer do any synchronization with atomic ints; I’m completely relying on concurrent collections as specified in the Oracle docs, and that the master thread is the only thread that can coordinate task (the old one allowed worker threads to add tasks). The old multithreader is still there, but is marked as experimental until I can prove or disprove that it’s thread safe.

Roquen · October 30, 2013, 9:46pm

I happened to still have this open and reskimmed. If the first “snippet” is representative then the problem is that the variable reference can legally be promoted to local copy and thus the thread won’t see changes (unless at some point the local version is updated). This is one of those trick things to notice because slight and potentially unrelated code change can make the compiler perform the promotion or not.

Change to volatile (as 65K mentioned…syncronization won’t help and there is no “race”) or an atomic. The latter more for code clarity than any practical purpose.

Also talking about “happens before” when discussing atomic operations is making the discussion too complicated.