NIO performance benefits

pepijnve · October 17, 2006, 1:14pm

My employer has asked me to prepare a presentation on NIO based on my previous experience with the apis. My experience is limited to using buffers to pass data to native code and exposing chunks of native memory via direct bytebuffers. I’ve been experimenting with other bits of the apis, trying to find some noticeable performance benefits in the context of file processing. Unfortunately I haven’t been able to find any so far. I was hoping someone could answer the following questions so I can get a better understanding of what types of performance improvements I should expect.

Has anyone noticed great performance improvements going from stream based io to channel based io? By this I don’t mean blocking, one thread per stream versus non-blocking multiplexed. Just InputStream versus Channel.
Does read(ByteBuffer) have any immediate benefits compared to read(byte[])? Related to that question is writing to a direct bytebuffer from native code always much faster than writing to an array using getPrimitiveArrayCritical?
I read on some guys blog that getPrimitiveArrayCritical basically halts all running threads. Was this actually and is this still the case in hotspot?

Riven · October 17, 2006, 5:27pm

To be honest, I’ve never seen noticable performance-improvements with NIO, however, that’s not because there aren’t any differences. In heavy-duty environments the difference is very appearant.

Imagine these two cases:
A: socket.getOutputstream().write(veryLongByteArray);
and
B: SocketChannel.open().write(equallyLongDirectByteBuffer);

In A, every single byte is pushed into the CPU-cache, swamping it, pushing everything else out.
In B, it’s all DMA, so you get to keep your valuable data in the CPU-cache.

If all you do is I/O, you won’t really notice the difference, but if you’re doing CPU-intensive tasks too, do NIO, with direct buffers.

blahblahblahh · October 17, 2006, 6:56pm

In the context of file processing, Sun’s NIO implementation was riddled with shocking bugs well into 1.5.x IIRC. I very very strongly advise you to go do a search on all outstanding bugs on bugparade for nio.. and io.File on the version(s) of JRE you’ll be using.

Java 6 is aiming to finally fix file processing in java. Some core fixes for bugs filed originally against java 1.1.x are promised / fixed against that, for instance basic functionality like finding out remaining free space on a hard drive before doing a file operation that is going to write large files.

…unless I’m misremembering, and they’re not due until java 7, in which case I shall probably abandon all hope of sun ever working out how to make java survive against .NET, and make the switch now :(.

pepijnve · October 18, 2006, 10:41am

That makes sense and is also the idea I had myself. I can understand the difference this makes if for instance you’re sending a file over a socket. No processing is done on the data so you don’t want to involve the cpu at all. The NIO variant allows you to eleminate the copies from native memory to the java heap and back. If however you actually need the data that got read into a buffer, would this still have a performance benefit? The data still needs to be pulled into the jvm in that case. I would expect to only see a difference if you can get hotspot to inline all the calls to ByteBuffer#get. When reading multi byte values it also might be faster to read them using a view buffer instead of manually assembling the bytes. Any ideas on this?

Jeff · October 18, 2006, 3:45pm

The other big advantage of NIO is that it allows you to use one thread for multiplel sockets.

in apps that have to scale to large number sof connections that can start making a serious performance difference.

Riven · October 18, 2006, 3:48pm

Jeff and me were typing at the same time it seems…

Once you start reading the bytes in the application, the CPU cache will get swamped, whether the JVM inlines that call or not.

The other performance-related part of NIO is that it can be setup to be non-blocking. With the Selector you can basicly do all your I/O in one thread (or a few, if you know what you’re doing), this removes all context-switching and a lot of memory consumption - a thread takes 1MB, so big threadpools (e.g. webservers) can waste a big part of your heap.

CaptainJester · October 18, 2006, 4:47pm

Don’t forget also that every access to an array causes a bounds check. In ByteBuffers, that check is eliminated.

Riven · October 18, 2006, 4:53pm

That’s simply not true.

Otherwise we could do wicked pointer-arithmetic with bb.get(-100) and bb.get(100000000).

When the JVM decides it’s safe to remove the check at certain points, it will also do it for arrays in that case.

Mr_Light · October 18, 2006, 11:32pm

Perhaps unrelated but I was fiddling around with nio and all. wenn profiling my app Bytebuffer.get() took the most time, ok that isn’t wierd since it was called awfully often but there didn’t seem a way to avoid this, was I beeing stupid?

Riven · October 19, 2006, 1:18am

Switch to bb.get(int) and run your code again. Much faster

kbr · October 19, 2006, 4:21am

While it’s true that the best performance is achieved when the implementation of (type)Buiffer.get(int) is inlined, in general performance of NIO Buffers seems to be pretty good for the applications (mostly 3D) where I’ve seen them used and used them myself. A lot of work went into the Java HotSpot server compiler in Java SE 5.0 update releases, Java SE 6 and is still ongoing to further optimize loops involving NIO and other constructs.

Yes, using multiple views of an underlying ByteBuffer to fetch multi-byte data will be faster than fetching the same values from the ByteBuffer itself. Make sure you have the byte order set to the “native” order (assuming your application is architected in such a way that this is possible – when you are only using Buffers on one machine, this is always safe to do, but when network traffic is involved, you of course have to be careful). If you look at the source code for the GlueGen project on java.net you’ll find a StructAccessor class which uses multiple views of a ByteBuffer to fetch data out of C structs efficiently.

pepijnve · October 19, 2006, 6:44am

Thanks for all the replies so far. I would now like to write some demo application that illustrates the performance benefit you could get from not thrashing the cpu cache. I came up with some code through the following reasoning:
The first part of the test should be a thread that’s simply calculating stuff using a dataset that fits entirely in the cpu’s cache. I’m working on a Pentium 4 which has 512kb, so I took a dataset size of 256kb.
To overwrite the entire cache I should do some I/O operation with a chunk of data that is at least as large as the cache size. I took a byte[] of 4Mb for this. For now the I/O consists of writing data to a file.
This gave me the following average timings for my calculation loop

No I/O: 2406ms
Stream I/O: 2594ms
NIO: 2500ms
So there is a difference, but it’s not as spectacular as I had hoped for Any idea on how I could improve on this idea to make the results clearer to my audience? The current version performs disk I/O and my hunch is that this doesn’t push data through the CPU fast enough to notice a big difference. I’m planning on changing the test to write out to a socket. Maybe that will work better.
If anyone’s interested, I’ve attached the source code.

Riven · October 19, 2006, 12:26pm

Write the results to another Buffer / byte[], to remove any of such bottlenecks.

You got a point though: when having I/O at rates of less than 100MB/s, the CPU-cache-swamping may not even occur, or if it does, it happens at such large intervals, you won’t notice.

pepijnve · October 19, 2006, 2:21pm

Unless I’m mistaken memory to memory copies do not go through a dma controller, so I think sockets are the best option. I’ll post the results when I get that working.

Mr_Light · October 20, 2006, 3:45pm

since bb.hasremaining() was used at a couple of other places I was unable to quickly adapt it to test your claim, presuming your right, whats up with that? how is my incrementing int gonna be that much more efficient then the bytebuffers own marker?

Riven · October 20, 2006, 4:10pm

the JIT seems to handle it differently
look at the sourcecode of both methods. conditional-branches and try-catch-throw statements are expensive

pepijnve · October 23, 2006, 2:10pm

As promised, here are my conclusions/results so far

No consistent performance difference between streams and channels (neither for files nor sockets)
50% increase in performance when reading doubles from a direct bytebuffer with native byte order compared to DataInputStream
50% decrease in performance when reading doubles from a direct bytebuffer with non-native byte order compared to DataInputStream

The first point is not really what I had hoped, but it’s not entirely unexpected. Both code paths do pretty much the same thing. The only major difference is the additional get/setByteArrayRegion calls. In my benchmark these didn’t seem to make much of a difference. The implementation of these methods is essentially a memcpy, which is pretty fast, but I suspect they also have to interact with the GC somewhere. If this is the case then the lack of performance difference could be due to the fact that I wasn’t doing any object allocation. I’ll test that later to see if it makes a difference.