Hello everyone!
I recently encountered an extremely rare race condition in my threading code. After 4 hours of debugging I finally managed to reproduce the error somewhat reliably and managed to track it down and fix it, but I still can’t wrap my head around WHY it happened in the first place.
The class in question is in charge of distributing tasks to worker threads. Since the total number of tasks was known, I simply kept an AtomicInteger counter to keep track of the number of completed tasks to be able to signal the main thread that all tasks had been completed. The problem was that the totalTasks field did not get synchronized, meaning that the value the atomic task counter was compared to was sometimes not updated. How can that possibly happen in the following code?
private PriorityBlockingQueue<Task> taskQueue;
protected AtomicInteger taskCounter;
protected int totalTasks;
public void run(TaskTree tree){ //A TaskTree contains tasks...
taskCounter.set(0);
totalTasks = tree.getNumTasks(); //This call here was sometimes not visible to the worker threads.
for(Task t : tree.getTasks()){
taskQueue.add(t); //This will cause the thread below to start processing.
}
//wait for completion signal that from worker thread(s)
}
//The worker thread run() method:
public void run(){
Task task;
while (running) {
try {
task = taskQueue.take();
} catch (InterruptedException ex) {
break; //Closed, kill thread
}
task.process();
if (taskCounter.incrementAndGet() == totalTasks) { //totalTasks sometimes had not been updated with the new value yet!
//Signal main thread that all tasks have been completed.
}
}
}
}
The code above is extremely simplified, so the taskCounter variable may seem redundant, etc. Anyway, the race condition only started occurring when I first ran a TaskTree with a single task (totalTasks = 1), and then ran a different TaskTree with 10 tasks. When the second TaskTree was run, a worker thread would sometimes process a single task, see that (taskCounter = totalTasks = 1) and signal that all tasks had been completed.
I “solved” this by counting backwards. By setting the taskCounter to totalTasks, decrementing instead of incrementing and comparing to 0 the race condition was avoided, but I still can’t fathom why the locking done by the task queue does not cause proper synchronization between the threads. How can the thread be woken up from awaiting a lock and then read the wrong value? I was under the impression that locking a lock would cause all memory to be synchronized properly each time a thread acquires that lock.
Thanks for reading!