Concurrency Library

DrakosKai · March 30, 2015, 8:04pm

Hello, I’ve been reading posts on this site for a few years but I’ve never posted. I’ve been working with a library that I think Java game designers may find interesting:

https://lmax-exchange.github.io/disruptor/

It’s a very low latency Concurrency library.

Edit:
I suppose I should post a link about why I think this is relevant. An example of a game that uses a similar multi-threading model is Doom 3 BFG edition. It’s all nicely explained here in this blog:

http://fabiensanglard.net/doom3_bfg/threading.php

and here is the source:

Hopefully this will be useful to someone.

theagentd · March 30, 2015, 8:26pm

Can you explain what a “disruptor” is? Is it a replacement for ArrayBlockingQueue that’s faster?

basil · March 30, 2015, 8:28pm

… or in general compared/replacement to java.util.concurrent ?

DrakosKai · March 30, 2015, 8:32pm

It’s difficult to explain it all without repeating what the creators of it have stated publicly on the links that I posted so I will paste in the abstract from their white paper:

“LMAX was established to create a very high performance financial exchange. As part of our work to accomplish this goal
we have evaluated several approaches to the design of such a system, but as we began to measure these we ran into
some fundamental limits with conventional approaches.
Many applications depend on queues to exchange data between processing stages. Our performance testing showed
that the latency costs, when using queues in this way, were in the same order of magnitude as the cost of IO operations to
disk (RAID or SSD based disk system) – dramatically slow. If there are multiple queues in an end-to-end operation, this
will add hundreds of microseconds to the overall latency. There is clearly room for optimisation.
Further investigation and a focus on the computer science made us realise that the conflation of concerns inherent in
conventional approaches, (e.g. queues and processing nodes) leads to contention in multi-threaded implementations,
suggesting that there may be a better approach.
Thinking about how modern CPUs work, something we like to call “mechanical sympathy”, using good design practices
with a strong focus on teasing apart the concerns, we came up with a data structure and a pattern of use that we have
called the Disruptor.
Testing has shown that the mean latency using the Disruptor for a three-stage pipeline is 3 orders of magnitude lower than
an equivalent queue-based approach. In addition, the Disruptor handles approximately 8 times more throughput for the
same configuration.
These performance improvements represent a step change in the thinking around concurrent programming. This new
pattern is an ideal foundation for any asynchronous event processing architecture where high-throughput and low-latency
is required.
At LMAX we have built an order matching engine, real-time risk management, and a highly available in-memory
transaction processing system all on this pattern to great success. Each of these systems has set new performance
standards that, as far as we can tell, are unsurpassed.
However this is not a specialist solution that is only of relevance in the Finance industry. The Disruptor is a generalpurpose
mechanism that solves a complex problem in concurrent programming in a way that maximizes performance, and
that is simple to implement. Although some of the concepts may seem unusual it has been our experience that systems
built to this pattern are significantly simpler to implement than comparable mechanisms.
The Disruptor has significantly less write contention, a lower concurrency overhead and is more cache friendly than
comparable approaches, all of which results in greater throughput with less jitter at lower latency. On processors at
moderate clock rates we have seen over 25 million messages per second and latencies lower than 50 nanoseconds. This
performance is a significant improvement compared to any other implementation that we have seen. This is very close to
the theoretical limit of a modern processor to exchange data between cores.”

theagentd · March 30, 2015, 11:31pm

If you want to attract developers that aren’t the main target of the library to try it out, you should summarize what it can do and what it replaces, show how easy it is to implement and what the gains can be. The text you copied is too long, too “marketing-ish” and full of implementation details that don’t really say anything.

It’s a data structure that allows producer threads to pass data to consumer threads.
It kind of fills the same role as ArrayBlockingQueue but is more of a whole package. It’s basically a system that distributes tasks to a java.util.concurrent.Executor.
It preallocates all data on creation and won’t produce garbage as it runs (ArrayBlockingQueue’s locks produce garbage, which is the biggest source of garbage in my graphics library).
Supposedly it has much better and more consistent latency and significantly higher throughput as well.
It seems to be reasonably simple to use. Here’s an Hello World level example:


public class Simple {
    @SuppressWarnings("unchecked")
    public static void main(String[] args) {

        ExecutorService exec = Executors.newCachedThreadPool();

        // Preallocate RingBuffer with 1024 ValueEvents
        Disruptor<ValueEvent> disruptor = new Disruptor<ValueEvent>(ValueEvent.EVENT_FACTORY, 1024, exec);

        final EventHandler<ValueEvent> handler = new EventHandler<ValueEvent>() {
            // event will eventually be recycled by the Disruptor after it wraps
            public void onEvent(final ValueEvent event, final long sequence, final boolean endOfBatch) throws Exception {
                System.out.println("Sequence: " + sequence);
                System.out.println("ValueEvent: " + event.getValue());
            }
        };
        // Build dependency graph
        disruptor.handleEventsWith(handler);

        RingBuffer<ValueEvent> ringBuffer = disruptor.start();

        for (long i = 10; i < 2000; i++) {
            String uuid = UUID.randomUUID().toString();
            // Two phase commit. Grab one of the 1024 slots
            long seq = ringBuffer.next();
            ValueEvent valueEvent = ringBuffer.get(seq);
            valueEvent.setValue(uuid);
            ringBuffer.publish(seq);
        }
        disruptor.shutdown();
        exec.shutdown();
    }
}

theagentd · March 31, 2015, 8:57am

TL;DR: Verdict: would not use for games.

I couldn’t sleep, so I tried to implement a lock-free ring buffer similar to the one that the Disruptor uses but simpler. I realized that since the total number of tasks is known when a task tree is started, there is no need to worry about the ring buffer becoming full if it is simply big enough, making it less complex to implement. I use 3 AtomicLongs to write into the ring buffer, but the “problem” lies in blocking when waiting for a new value to be added to the ring buffer. This essentially does require blocking, and due to how the blocking Disruptor uses works it actually ends up being slower than ArrayBlockingQueue. Disruptor can also block by spinning or yield()-looping, but both spinning and yield()ing cause a lot of stuttering in my game as threads compete with the driver for CPU time. However, in a synthetic test measuring the overhead of my threading library performance skyrockets when using yield()-looping.

Single thread and simple queue: 880 000 task trees / sec
ArrayBlockingQueue: 66 000 task trees / sec
Ring buffer blocking: 19 800 task trees / sec
Ring buffer yield() loop: 210 000 task trees / sec

It seems like the blocking ring buffer has the potential to be faster in some cases since if the almost never is queue isn’t empty a lock() is avoided. In my actual game, the blocking ring buffer does seem to be slightly faster, but this seems to be a rather small and inconsistent difference, possibly just random VM optimization differences from run to run. In practice, Disruptor seems to be useful when a massive number of messages are passed between threads, but if most threads just block or spin performance will tank. This is rarely the case for games, so I will probably not use this for games for now.

EDIT: Also, since the blocking version uses the same locks as ArrayBlockingQueue it doesn’t improve on the garbage generated unless a yield()ing or spinning wait strategy is used.

nsigma · March 31, 2015, 9:44am

Could you elaborate on this, seen as I haven’t profiled either - what garbage is generated? And why?

To the OP - Disruptor has been mentioned a lot on this forum, though almost always by @Spasi

theagentd · March 31, 2015, 10:00am

Roquen · March 31, 2015, 10:19am

Disrupter is cool, but if you’re game needs it: you need to rethink your threads and how the communicate.

Spasi · March 31, 2015, 11:33am

[quote=“theagentd,post:6,topic:53977”]
The Disruptor has not been designed to distribute tasks to a thread pool (i.e. what ForkJoinPool does with work-stealing). That is a multi-producer-multi-consumer (MPMC) problem that’s going to suffer from contention and/or other overheads (e.g. garbage), no matter what you use. Just look at your single-threaded example, it’s 4 times faster than the best alternative, in which case the question is, why are you even trying to use concurrency?

In order to get the most out of the Disruptor, you need to have dedicated threads doing work in stages (see SEDA). This includes many SPMC and MPSC scenarios and is perfect for controlling access (back pressure + batching) to contended resources (typically IO). ArrayBlockingQueue is a general-purpose bounded queue optimized for the worst-case (MPMC). The Disruptor, when properly configured for a specific workload (SPMC or MPSC or even SPSC), destroys ABQ in both throughput and latency. You could use it for MPMC too, but you’re going to see much bigger gains with a different design, not a different queue.

The Disruptor uses a lot of tricks to reduce latency (PoT buffer size, preallocated payloads, padding objects to avoid cache-line false-sharing, etc) but I’d say its best feature is batching. Queues in a running system are usually either mostly empty (slow producer) or mostly full (slow consumer). Besides the fact that JDK queues have trouble with tail/head false-sharing, the Disruptor enables batch production or consumption of messages, which means you pay the sync overhead once per batch, instead of once per message. This does wonders for latency and CPU utilization.

[quote=“Roquen,post:9,topic:53977”]
I think specialized solutions like the Disruptor will become very useful with the Vulkan/Mantle/D3D12-style of graphics/compute programming.

Roquen · March 31, 2015, 12:38pm

I can’t think of a reason to be other than SPSC.

DrakosKai · March 31, 2015, 1:06pm

One of the reasons I posted the other links was because id Software uses a similar model + a parallel work stealing model. If you read id’s guidance on their threading they clearly state the need to keep tasks within a min/max of clock cycles. Personally I’ve managed to refactor it and tailor it to my needs on a few occasions. Typically I’d use it for access to high demand data structures.

btw I’m not trying to attract you to it. I’m not a salesman, I don’t need anyone to buy into it. I’m sure there will be some people who took a look at this thread and benefitted from it.

Edit: yuck, I got pulled into a online meeting and accidentally hit post before I finished proofreading my post, I had some Copy/Pasta going on ;D

theagentd · March 31, 2015, 1:40pm

This is a torture test of my threading system measuring the overhead of it, e.g. the task distribution and dependency system. I don’t run 1 000 000 task trees per second, I run 1000 at most, hence the gains to be made are small in the first place. If I add a realistic amount of work to each task, that workload quickly becomes the main bottleneck.

The main attraction for me was that it would eliminate the last source of garbage of my engine. If that means I’ll be forced to do spinning/yield()-looping it’s without a doubt not worth it, as that sucks performance from the texture streaming thread, the driver threads, etc. With Vulkan (assuming the driver teams do their jobs right) we should get minimal/no driver threads, so it might be more relevant at that time. Still, if you’re exchanging that many tiny messages between threads in the first place, I suspect you’re doing something wrong.

Riven · March 31, 2015, 5:28pm

Disruptor was developed for a system dealing with financial transactions, where messages are tiny and abundant, and microsecond latency can cost a fortune. Their core logic is single threaded on 1 node. Their bottlenecks are not usually found in game-engines, where typical tasks take a lot longer than mere micro-seconds.

Sidenote: Why isn’t the false sharing issue fixed in ABQ? It seems trivial to add object field padding to force seperate cache lines for these fields.

DrakosKai · March 31, 2015, 8:28pm

I probably shouldn’t respond any further on this thread. But I feel like most people are focusing on each line of my post independently of the others and thus taking my point out of context. So I will state and then be done.

The model of the disruptor is a very good example of how to implement a non-locking threading algorithm in Java. Personally I made a completely different version of it that also combines in with a work stealing queue. Well, by “made” I mean “is mostly done and currently testing and tweaking”.

The reason I have faith in this pattern is because it is used in successful game engines already(Thanks id Software for sharing). id uses a similar pattern to the disruptor–

http://fabiensanglard.net/doom3_bfg/threading.php

–however, this is in C++ so that’s where the source to the disruptor comes in. Personally, I think if taken in context, these two pieces of information are gold as far as Java gaming is concerned.

princec · March 31, 2015, 10:41pm

hmm… apparently not so trivial after a little investigation. Or at least, not widely known about.

Cas

theagentd · March 31, 2015, 11:23pm

Doom 3 hardly seems like a good example. Barely scaling to a fixed couple of cores is not what I call “heavily multi-threaded”.

Roquen · April 1, 2015, 7:19am

http://fd.fabiensanglard.net/doom3_bfg/worker_atomic_sync.png

AtomicInteger = bus locking. Not a big deal but potentially stealing cycles from everybody.
Linear Ring buffer = false sharing extravaganza.
jobs are processed FIFO…meh.
small jobs = very inefficient, esp if they don’t happen to be the same code/linear data and happen to go to the same worker.
large jobs = potentially long latency…job submission to completion.

I can’t think of a game client reason to use this model. And like I said earlier I can’t think of a reason to be anything other than SPSC. Strawman example: Take an audio processing thread. Of course it’s completely reasonable to have multiple threads wanting to send commands to it, but there’s no reason for it to be through a single communication channel.

I’m only thinking about consumer grade hardware. Otherwise we’re talking about MMO server code…and anyone ready for that doesn’t need this conversation. Consider:

Is there any major paradigm shift in CPU ISAs on the horizion? No.
How fast is the number of physical cores growing? It’s complete stagnated.
How likely are we to see multi-socket consumer devices soon. Not very.

And even if I’m wrong on any of these points it doesn’t really matter because it’ll be years before penetration is large enough to be worth considering. So SPSC is completely viable for the foreseeable future.

Roquen · April 1, 2015, 8:04am

Oh, I’m excluding consoles from my thinking here…that seems reasonable.