Impossible race condition between threads?

theagentd · October 30, 2013, 9:54pm

So the fact that it was happening randomly was due to the JIT compiler? What exactly does “promoted to local copy” mean? The variable being assigned is a class variable… Thanks for your response!

Roquen · October 30, 2013, 10:08pm

JIT = yes. The JIT will treat a thread as being the only one running. So if it sees that a variable cannot be modified in a method or loop (including deductions it may be able to make about all possible sub-methods that are called) it can pull that variable into a local (stack) variable or a register. Specifying volatile tells it that the variable must be read/written to main at each access.

theagentd · October 31, 2013, 11:26am

Makes sense, but I don’t really understand what criteria the compiler is using to determine if it should promote a variable. My Google-fu gave me no relevant results…

It does not seem to take into account that the worker thread run() method was actually in an inner class that extended Thread. It should be obvious that the variable is used from multiple threads, but I guess it’s not surprising that it didn’t look that far. Besides, it’s entirely possible to manually call the Thread’s run() method without starting a new thread… murr

Roquen · October 31, 2013, 9:36pm

It’s too much work for the compiler to deduce (pretty often impossible)…every variable would have to be provably never modified by some other thread (or interrupt, etc.). Every variable that couldn’t be proven so would be equivalent to volatile (that would be bad). Much better to have the programmer say: “Yo someone is reading or writing to this behind your back.” and just declare volatiles.

This will be compiler specific. An example is the value is loaded into a register to be used, and when that register is needed for something else it’ll spill (be copied onto the stack) and when needed again the stack copy will be reread (again the compiler has deduced that the main memory version hasn’t changed…so no reread is needed). That’s an example of when changing something that looks like it’ll have no effect will make the bug appear and disappear.

Typed out quickly…hope ya’ll can follow me.

theagentd · November 1, 2013, 11:09am

Roquen:
You mentioned that modifying a variable in a loop or through a method would inherently make the variable volatile. Is that always the case?

If I can’t have any assumptions on what the compiler can do to break the synchronization from my concurrent collections, don’t I have to make everything volatile anyway? For example, let’s say I want to pass a message object to a different thread. This message object is reused by a single thread (in a safe way of course). When the thread wants to send a message, it sets the variables of the message object and sends it to another thread. Doesn’t all variables in that class need to be volatile in this case? This example can be extended to almost all parts of my threading library…

nsigma · November 1, 2013, 12:10pm

Feeling you’re reading that backwards!

Assuming you’re using the built-in concurrent collections (or a properly designed one), then no. See the section on memory consistency (as Riven linked to earlier), in particular around concurrent collections. The JIT compiler can’t break things that are guaranteed by the JLS to be happens-before.

http://docs.oracle.com/javase/7/docs/api/java/util/concurrent/package-summary.html#MemoryVisibility

theagentd · November 1, 2013, 3:53pm

But that’s why I wrote “impossible” in the thread title. I wrote to the variable and added an object to a PriorityBlockingQueue, but after taking out the object from the queue the variable had not been updated… Does the variable have to be inside the object passed into the queue? Such a requirement isn’t mentioned in that link.

Spasi · November 1, 2013, 5:22pm

The problem here is rather obvious and is what Roquen said in a previous post; the totalTasks value is read only once, instead of on every iteration of the while loop, and cached in a register. This is a perfectly valid transformation for the processing thread to do and it has nothing to do with synchronization or happens-before relationships. As far as it is concerned, it is the only thread doing anything with it. The fix is also easy, you just mark it volatile and it will force the processing thread to always read the value from memory. The memory reads enable all the happens-before constraints to become applicable (the PBQ alone is enough to provide the semantics you need) and you should always read the correct value.

In general, if you ever access the same memory location from multiple threads (and at least one access is a write), you must use volatile. In most cases though, shared memory access exposes you to all kinds of trouble and immutable message passing is almost always a better solution.

[quote=“theagentd,post:23,topic:44753”]
This should be obvious from what I wrote above, but there’s no “analysis” going on. There’s no difference between “normal” code and code running in a Thread/run(), it’s all the same thing and the same optimizations apply. The only relevant information for the JIT compiler is whether the field is volatile or not.

[quote=“theagentd,post:25,topic:44753”]
No. The loop may be optimized by reading the field once, doing all mutations in a register, then writing out the result once at the end of the loop. Even if the JITed code always reads/writes from/to memory, it would still be different from volatile. Volatile provides very specific semantics and is implemented with special CPU instructions. A write to a volatile field is actually a write followed by a write-release barrier. A read to a volatile field is a read-acquire barrier followed by a read. The write-release barrier synchronizes with (i.e. happens-before) the read-acquire barrier. There’s nothing like that going on for non-volatile field accesses, optimized or not.

nsigma · November 1, 2013, 5:56pm

??? Actually shouldn’t the take() in every iteration mean that totalTasks is not cached?

From link above

[quote]Actions in a thread prior to placing an object into any concurrent collection happen-before actions subsequent to the access or removal of that element from the collection in another thread.
[/quote]
@theagentd - btw, I’m wondering why you’re trying to build this without making use of other concurrent utilities within the JDK? Couldn’t you look at the standard executor service, and use futures to model task dependencies? Or fork / join?

Spasi · November 1, 2013, 6:58pm

[quote=“nsigma,post:29,topic:44753”]
The way I understand it is the read indeed happens after the first take() and then cached. It explains this:

[quote=“theagentd,post:1,topic:44753”]
i.e. it works correctly only the first time through the loop, then the optimization kicks in.

DrZoidberg · November 1, 2013, 8:21pm

I may be misunderstanding something here, but why don’t you use futures? Create a thread pool with e.g Executors.newFixedThreadPool() and then add some anonymous objects of type Callable to it. That gives you a list of Future objects and then you simply call get() on all of them to block until all threads are finished.
That way you don’t need to synchronize anything and you don’t need volatile either. As long as the data you pass into your Callables is immutable and your callables are side effect free, you are guaranteed to never run into any kinds of threading problems.

Roquen · November 1, 2013, 11:29pm

Some notes: This whole “happens before” and related formal speak…don’t bother with it. You need it if you’re going to write a compliant VM or if you need to have a formal conversion. For actually writing your average to not so average multi-threaded code it just needless complicates everything and will be more confusing than helpful. Atomics and volatiles are straight forward (and you’re not going to run out and make your own custom concurrent data-structures 'cause you’re too smart to do that).

On volatiles. Probably most people should completely ignore their existence…at least until your feet are wet with multithreading and have a basic low-level understand of how a program is translated into code. You don’t really need them as you can use the more flexible atomic wrappers instead to do the same thing and much more. They (volatiles) are pretty limited in usage without very careful thought and that extra thought isn’t worth the extra memory usage the atomic wrappers will cost (heck it can even be a good thing by lower the chance of false sharing). And for those that care…performance-wise atomic-wrappers and volatiles are about the same…on intel-a-likes they both use the ‘lock’ prefix on the instructions (this locks the memory bus for the duration of the op execution and flushes all pending memory operations).

Inexactly marking a field volatile says:

Every time I specify a read to this field: you must read it (and not use a local copy, register or otherwise)
Every time I specify a write to this field: you must write it…even if it seems pointless or redundant
Don’t move this around and make sure memory is up-to-date before doing the read or write.

If this isn’t perfectly clear: Use the atomic wrappers (just to repeat myself).

Spasi · November 2, 2013, 10:01am

Roquen, I understand your PoV, given the intended audience, but I’ll have to slightly disagree.

The atomic classes are nothing more than wrappers around a volatile field, plus CAS-based utility methods.
The most common use of atomic classes is to concurrently count or sum values. For those usages, atomic classes have been superseded by LongAdder/LongAccumulator and DoubleAdder/DoubleAccumulator in JDK 8, which have much higher throughput under high contention.
Then it’s usually (ab)using atomic classes for concurrency control. CyclicBarrier, CountDownLatch and Phaser are much better options here.

There’s not much utility left for atomics, except for people writing libraries. I think a novice programmer would be better off sticking with volatile and all those methods with the funny names in the atomic classes would just be confusing. The real trouble is getting familiar and understanding when and how to use the java.util.concurrent classes though. That should always be the first option.

For those wanting to learn more, these 3 posts are interesting: 1st, 2nd, 3rd. Not Java specific, but it’s better to understand what happens at a lower level.

nsigma · November 2, 2013, 1:06pm

take() enters a ReentrantLock, which has the same semantics as a read/write on a volatile, as does add() - hence

meaning that totalTasks should be uncached on every iteration, yes?

This is assuming calls to add() and take() are properly paired up and there is no way the totalTasks value can be changed during the time a task is being processed. Just saying I think something else is the cause of this issue, not in any way saying that I think this code is a good idea!

That’s what I said earlier, though it’s complicated by the need for task dependencies. You could possibly look at tasks blocking on the Futures of subtasks, assuming tasks are added in depth first order this might work without creating too many threads? The other thing I’d look at is using two queues - ie. a queue for completed worker tasks passed back to the main thread - this way actual task counting / management is single threaded.

Yes, personally I think the novice, and usually everyone else, is better working with higher-level constructs such as producer-consumer queues or executors from java.util.concurrent No shared state ftw! ;D

theagentd · November 2, 2013, 7:04pm

Everyone, thank you so much for your answers! It took me a while to absorb everything. I won’t answer everything, but I did indeed read it all.

The problem is like nsigma wrote the complicated dependencies between the tasks.

@Riven’s first post
From http://preshing.com/20130823/the-synchronizes-with-relation/:

[quote]An atomic operation A that performs a release operation on an atomic object M synchronizes with an atomic operation B that performs an acquire operation on M and takes its value from any side effect in the release sequence headed by A.
[/quote]
That would explain why my original multithreading code was working and should also prove that it should always work, assuming that what’s written in that link also applies to Java.

[quote=“Spasi,post:30,topic:44753”]

[quote=“nsigma,post:29,topic:44753”]
The way I understand it is the read indeed happens after the first take() and then cached. It explains this:

This makes sense and fits very well what I saw. I usually ran the game single-threaded for a short time by simply calling all task.process() on all tasks from the main thread. Then I’d press a button to run the game multithreaded and it’d explode. Sometimes it’d “just” corrupt the task queue by finishing early and finishing early, effectively corrupting the whole executor and task tree. Rarely it would also freeze, which could be explained by the executor waiting for 10 tasks to finish when there was only 1 task to run.

I’m a bit worried about this though. What could it possibly have been in that case? >_< If add() and take()s aren’t properly causing synchronization between the threads, wouldn’t that break their rules? If we imagine a computer with an infinite (or sufficiently large) number of registers wouldn’t that mean that literally any variable not marked as volatile would be cache-able by the compiler if it always assumes that only a single thread can modify each variable.

Here’s a live example from my engine (which works of course). I do frustum culling on all 3D models for each perspective (camera + shadow maps) to determine which models that need their bone animation skeletons updated.

Task 1-8: Loop through all models, setting boolean needsUpdate = false for each model. Each task processes 1/8th of all models. (It also does other things, like updating bounding boxes and stuff.)
Task 9-16: Go through all perspective, checking which models are visible to which perspectives. Once a model is visible to a perspective, I set needsUpdate = true for that model. Each task handles a subset of the perspectives.
Task 17: This task simply loops over all models, counting them and adding them to a list. This list then holds all models that need to have their skeletons updated.
Task 18-25: Loop over all models that need to be updated, computing the interpolated bones and generating matrices for each skeleton.
Task 26 (run on OpenGL thread): Maps a VBO, stores all matrices in it and unmaps it again. (Runs in parallel to terrain culling.)

If the compiler is legally allowed to ignore synchronization by storing variables in a register, wouldn’t that effectively mean that the compiler could pull all model.needsUpdate variables registers on the first run? If it can’t do that since it’s a loop over multiple models, what if there’s only one model? The resetting to false done by the first 8 tasks wouldn’t be visible to the subsequent tasks. The code that sets which tasks that need updating to true wouldn’t be visible, and the counting task would also fail since it wouldn’t know which ones had been set to true.

Isn’t this what we can call a compiler or VM bug? ???

EDIT: I ran out of medals to hand out. More to come when I get new ones!

Riven · November 2, 2013, 7:37pm

You’re basically implementing a CyclicBarrier.

It’s much more efficient (and convenient) to push all models that are ‘in view’ in a queue, which can be popped by other threads immediately, without waiting for all models to be frustum-tested. Your ‘barrier’ causes 7 out of 8 threads/cores to be idling prior to the last thread finishing. The whole needsUpdate field can be removed, as the state of a model is implicit to whether it was pushed in queue.

theagentd · November 2, 2013, 8:31pm

I’m not so sure. Each model may be visible by lots of different perspectives at the same time, but I only want to compute the skeleton for it once. If the threads write to a shared queue they’d need some kind of mechanic to ensure that the model is only added once. That kind of synchronization will result in worse performance I think.

I’m not quite sure what you mean by saying that 7/8 threads will idle before it finishes. Since the perspectives have very different shapes and sizes, the amount of computations they need varies a lot. My first implementation simply divided up the list of perspectives into equal shares for each core, but that resulted in widely different loads on the cores. I instead I’m now using an atomic integer which allows the threads to grab the next perspective that needs culling to achieve some basic load balancing. With 8 threads on an i7 (quad-core + Hyperthreading) I’ve gotten around 4.1x scaling when benchmarking these 8 tasks only.

EDIT: Oh, I think I misunderstood you slightly. You mean that I’d want to run both the culling AND the skeleton computation tasks at the same time, right? I think my point still stands though. The synchronization to check if a model is already being or has been updated would scale worse than what I have now.

DrZoidberg · November 2, 2013, 11:07pm

Can’t you reduce those dependencies?

Here is an example of what the code could look like.

final ExecutorService threadPool = new ForkJoinPool();

ArrayList<Model> models = new ArrayList<>();
final ArrayList<Perspective> perspectives = new ArrayList<>();

ArrayList<Future<Model>> futures = new ArrayList<>();

for(Model _model: models) {
    final Model model = _model;
    threadPool.submit(new Callable<Model>() {
        public Model call() {
            
            model.update();
            boolean visible = false;
            for(Perspective p: perspectives) {
                if(model.isVisible(p)) {
                    visible = true;
                    break;
                }
            }
            if(visible) {
                model.computeBones();
                model.generateMatrices();
                return model;
            }
            return null;
            
        }
    });
}

ArrayList<Model> visibleModels = new ArrayList<>();
for(Future<Model> f: futures) {
    Model m = f.get();
    if(m != null) visibleModels.add(m);
}

To make sure values are written to memory and not just kept in registers, all methods of Model should be declared as synchronized.

Riven · November 2, 2013, 11:17pm

@DrZoidberg: the issue is that frustum culling is only efficient when looking at it from the Perspective point of view (no pun intended). You don’t ask each model whether it is in view for a certain perspective, but look at the scene from each Perspective and gather all the models it intersects with.

theagentd · November 2, 2013, 11:35pm

Riven is correct. By using for example a tree you can reduce the number of checks per perspective drastically by being able to cull large amount of objects per test. Your code requires (perspective * models) tests.

EDIT: This is also interesting, but please get back on topic… ^^