Should you risk using NIO for hard-core networking

blahblahblahh · May 12, 2003, 12:30pm

I was promoting NIO to someone who thought it not worth it recently, and got into a little difficulty. I left the conversation with a few doubts, mainly because of the following points he made:

Why isn’t NIO properly documented yet? Lots of critically important parts of the API (e.g. How do you detect a disconnect at each and every stage) are not covered in the API docs…you have to read FAQ’s and tutorials to find them out instead. (note: this was true of the 1.4.0 release, but perhaps 1.4.1 or .2 has corrected this?)
Can we trust something that contained several show-stoppers in the first post-beta release? Is this actually being tested properly by Sun? Several of the bugs are basic problems that should have been uncovered by unit-testing, which sets a worrying precedent that they weren’t found. (I’m not intimately familiar with the bugs, but I remember spotting some stuff that was fixed in 1.4.1 that surprised me; IIRC there was one bug that the NIO API wasn’t actually implemented properly on Windows, it used completion ports, and was constrained to several low fundamental limits inherited from that. Also various problems of bits of network NIO just not working at all on some platforms, IIRC?)
How should one use network NIO for high performance servers? There are seem to be few patterns and documents anywhere on the web (I found one eventually, although it’s part of a research project at Stanford, and somewhat off the beaten track) describing how to use NIO in a typical tens-of-thousands-of-clients-per-server system - i.e. with at least 20 threads, and all the associated problems. The API’s do some pretty mean and nasty things when you start multi-threading (lots of obvious ways of using the API break because of e.g. places where threads are mutex’ed in surprising ways). I have in the past had problems with the fact that SelectableChannel.register() blocks if the Selector you’re registering with is actually in use; this makes it impossible to use .register() in the obvious way, in an MT environment - instead you have to write rather atrocious logic (from an OO point of view) in your select() loop which, immediately after a select() unblocks, registers a queue of awaiting requests. This is a stupid API design, and makes every user jump through the same hoop in every app that uses MT’d NIO networking.
How do you have 40 threads sharing the load of answering requests, working off five Selectors, and using several native IO subsystem pre-allocated ByteBuffers? (my suggested answer to this is that if you create a mem-mapped/direct buffer, then make views on it, one for each thread, hopefully this gives the same effective performance as if you had just one thread and one mem-mapped buffer (no views).The API is not entirely clear on this, saying only that mem-mapped buffers “behave no differently to direct” BBs, and that direct BB views are also direct; it doesn’t make explicit what the memory requirements are for making views on direct BBs (saying only that you, as the programmer, won’t be able to measure it because they are outside the java IO system). Who really knows what the overhead is per view?

…these were his major points, and I could only suggest ways in which they MIGHT be solved, but we couldn’t find details in the API docs to answer all of these questions. Does anyone know the answers to the above questions? Do you think it’s worth worrying about these problems (they mostly seem to be issues of poor documentation, which you eventually learn yourself the hard way anyway, but the possible lack of testing by Sun had me worried)?

Herkules · May 12, 2003, 2:16pm

Hard to tell, the ‘N’ stands for ‘new’ anyway.

My stuff works with JRE1.4.1_01, but not with 1.4.1 due to a subtle bug in ByteBuffer#slice().
If I ever developed with 1.4.1, I’m not sure wether I ever uncovered that fact or if I just would have given up.

After I figured out some things with own experiments (e.g. what happens if a connection closes), I have to say NIO works like a charm to me. But I admit that is wasn’t easy to get there.

Examples/tutorials exist, but they all cover the same, simple situation. I wished there were more.

Currently I’m not sure wether I do everything like it was meant to do. So I still feel a bit insecure.

But is there a choice? From an architetural point of view there is no other way for many-connection-apps that don’t want to afford as many threads (be careful with slice() in a MT environment BTW).

Wether you dare to suggest it depends on the business situation I think. For a game, there might be enough room for a little risk and checking things out?

blahblahblahh · May 12, 2003, 7:10pm

…I know exactly what you mean :)…

Thanks for the warning about slice…what in particular goes wrong with it?

I agree that in many cases there may seem to be no other options; generally that leads to a choice between a variety of equally crap alternatives (for networked stuff, usually: use a different language, use Java with JNI (which turns into really ugly hackery in this situation), or use 3rd party software (IF you can trust it, which often you can’t)).

Herkules · May 13, 2003, 5:58am

Nothing particular. It just isn’t a stand-alone structure but just a view to a buffer. So if you overwrite the buffer in one thread while another uses the slice… common MT problem.

GergisKhan · May 13, 2003, 1:37pm

Herkules,

Do you have any other examples other than what Sun has provided for NIO? I’m doing conversions to 1.4 at the moment and I’d like to use it, but Sun’s is very small and doesn’t really address what I need to understand.

Further, you mentioned in another thread that using XML isn’t a good way to send game data over the wire. I agree. I’d like to know if you could describe the method you use a bit clearer. For example, you said you’re sending a ByteBuffer. What are its contents (if that’s ok for you to describe).

I’m looking for a simple example. Right now, my networking sends “sentences” comprised of a Command ID, subject, target, and additional params, like:

05:8A8B:409F:48:45:4B:4B:4F

Which is something that translates to private message from person ‘A’ to person ‘B’ : hello

Any thoughts?

blahblahblahh · May 13, 2003, 4:01pm

I’m doing two NIO servers at the moment for different projects; one has to handle several thousand players, the other has to handle 10,000. I’m working towards maximally efficient IO, but worrying too much about it yet - so don’t take my examples as being necessarily good :D. One thing to watch out for is that relative efficiency in NIO is VERY sensitive to the sizes of Buffers, and also to lots of circumstantial details - there is also a lot of variability because the underlying implementation of NIO on different platforms is very different, using whatever native features are available.

For a server that has to handle thousands of clients at once, I’m using a couple of thread pools, so that I can independently scale the number of threads at different points in the request-calculation-response pipeline. It’s also a good way (at this stage) of discovering MT-problems and non-thread-safe code before we get to later development stages :).

I have a single selector for the ServerSocketChannel, served by one thread; it’s possible (I’ve done it before) to have multiple threads on this selector, but it’s tricky (I can provide a detailed example if you need), and if you are ONLY doing accepts() on the Selector, then probably unnecessary. There might be an easy way - if so, I haven’t found it yet.

That thread takes accepted connections and registers them with one of several read-selectors. The first reason for multiple read-selectors is that you can have one thread per selector, and get effective MT on reads, without having to f*** about with MT on a single selector. The second reason I have several is to ensure that the response-time per client is close to the average response-time; if you have thousands of clients in one select, the first ones will get great response time, the last ones will get poor times (this is heavily dependent on what you do next, but multiple selectors here avoid problems further down the line, even if you make mistakes later on).

Each read-selector has a single thread (as noted above, I wanted MT on one selector, but it just proved far too difficult). It uses a helper class to decide whether the incoming data constitues a complete request - if not, it just attaches the incomplete request to the SelectionKey, and moves on. If it IS a complete request, it parses it into some form of RequestObject. Note: I’ve had to do servers in the past, using NIO, where the only way to tell if a request/message was complete was to parse it - this can happen with, for instance, an extensible message-format. Life is MUCH easier if you choose a network protocol which has an explicit “END-OF-MESSAGE” marker, and that guarantees to escape that marker if it ever appears in the body of the message. If you don’t have this luxury, I can suggest a slightly different approach which works well, based on what I did before.

Parsed RequestObject’s get passed to a “RequestServicer” object, via a method like “addToIncomingQueue(…)”. From here on in, you can just use standard thread-pooling techniques to achieve decent concurrency on the number of reads. Theoretically, unless your parse(…) method is particularly expensive, you also make the initial-response to any particular client close to the average response time (as mentioned above).

As the queue-processing-threads act on messages and generate responses, they just attach the response back onto the SelectionKey, and mark the Key as being “interested in OP_WRITE”; the read-selector threads also take care of writing, if a key isWritable().

Let me know if this helps (I’d also be pleased to hear from anyone who can spot a gross inefficiency in this approach! :)). This is the first time I’ve tried this particular take on things - previously, I had one set of selectors dedicated to reads, and a different one dedicated to writes.

Perhaps you need something with more details? For instance, I haven’t mentioned any of the hoops you have to jump through with Buffers - but that’s somewhat dependent on your higher-level strategy. On this project, I’m using a combination of static pre-allocated direct ByteBuffer’s - passing a readOnly view on each (HOPING that the "direct"ness of the BB percolates through properly) - and dynamically generated BB’s, combined using a Gathering Write. I’ve logged an RFE with Sun on the grounds that View’s of a BB don’t let you automatically lock/synchronize with the underlying position(), making various obvious implementations a PITA. I’m not sure what difference it makes if your gathering write combines only direct BB’s, or a mix of direct and non-direct - I would EXPECT it was slower in the latter case? But the API docs don’t appear to be explicit on this matter (I may just have missed the key sentence, though I’ve re-read them several times in the last few weeks!).

I even experimented with doing non-blocking reads, and blocking-writes - on the same Socket(Channel) (this is what one of the Sun examples did!). It works; I was very dubious that it would, and suspicious even after I’d tried it myself :). But I find it quite disturbing, and I’m steering clear of doing it again :).

bt_dan · May 13, 2003, 4:30pm

I am new to this site looking in particular to more insight on this particular subject. We are doing a very similar approach of having multiple selectors for read, one per thread.

Have you done any stress testing to find an efficient number of connections to be handled per read selector?

What in particular about the Byte Buffers is so sensitive?

I have seen some rather long latencies in the testing that I have done and am in the process of determining if those latencies are on the underlying network connection or somewhere inside the NIO code.

I have noticed rather long garbage collection periods in between the selector.select() methods when running the -gc:verbose command. These may be a result of how I am managing my ByteBuffers. (??) My approach of handling incoming connections is rather simple. Each SelectionKey.attachment object reads in the first 4 bytes which is the size of the message, then allocates a ByteBuffer of that size and does non-blocking read until buffer is full, then flips, reads and sends to be processed at which point the ByteBuffer.clear() is called (which doesnt really free up the memory, just resets the indexes) and does process all over again.

Has anyone done any load testing with simple NIO example? What kinds of results are seen?

Thanks for any insight.

leknor · May 14, 2003, 3:05am

[quote]I’m looking for a simple example. Right now, my networking sends “sentences” comprised of a Command ID, subject, target, and additional params, like:

05:8A8B:409F:48:45:4B:4B:4F

Which is something that translates to private message from person ‘A’ to person ‘B’ : hello

Any thoughts?
[/quote]
I’m of the opinion that unless you are trying to trim bits off your protocol you should include a ‘length of this frame’ field so that older clients can recover from unknown packet from a newer client. For a little more info see section 2.3. (Blah suggests something similar in paragraphe 5 of his big post above.)

blahblahblahh · May 14, 2003, 5:08am

[quote](Blah suggests something similar in paragraphe 5 of his big post above.)
[/quote]
Grin :). Very good advice for protocol design. I’d also add that you should include a “conversation ID” and “message ID” in each message (one is constant for all messages in a sequence, back and forth, but changes for each new sequence. The other is one use only.

They are IMMENSELY helpful when debugging. If msg length is REALLY critical then you can take them out for your final version. However, the length of the msg is often not the bottleneck, so you can often get away with leaving them in. If you have debug code that can be switched on in the final version, it can make great use of messages when a user discovers a problem that involves the networking.

Herkules · May 14, 2003, 7:53am

My selection scheme only uses one selector and one thread. I didn’t check against a large number of connections. blahblahblahhs description seems to be very feasable for scalebility.

For each connection I maintain one preallocated ByteBuffer to avoid garbage. Data is streamed into it and roughly parsed as long as there is at least one complete message in it. This can be determined by comparing a transfered length with the limit() of the buffer. When reaching the end of the buffer, I reset the buffer and copy the last unparsed message to its beginning and continue reading.
When I have a complete message in place, I split() the buffer and pass it to a system of listeners that actually decode it.

I did stress tests with several clients trying to flood a server with chat messages and the server could handle it properly.

I don’t work with integer IDs in any way. My protocol works with ‘Identity’ objects that can be serialized to/from a wrapped ByteBuffer (NamedIdentity, ByteIdentity, LongIdentity…). But this is more a technical detail. Basically it could be a simple ID.

Each message has to carry at least one Identity object at it’s beginning to denote the ‘channel’ I am sending on. From there on, the protocol is channel specific. For everything in my (high-level) system is identifyable, means has a method getIdentity(), things can be uniquely addressed over the network easily.
But this of course has nothing to do with NIO, but with the architecture of the networking system.

For the case that client and server reside in the same application (common for games), my network system allows to pass messages directly to between them ommiting network stuff completely. This had some tricky synchronisation issues, but works meanwhile…

leknor · May 14, 2003, 11:46am

[quote]blahblahblahhs description seems to be very feasable for scalebility.
[/quote]
It has one weakness that I see that could manifest itself on more than one CPU servers.

Since blah has a Thread for a bunch of connections and each of those threads works in a serial manner it would be possible for one thread to get backed up while the others are empty, wasting spare CPU cycles. This could lead to some users experiencing slower response times than needed.

The commonly suggested pattern of one Thread working the Selector and worker threads has one big negative which blah advoids and is why I choose not to mention it earlier. In the case of one Thread works the Selector and hands off buffers to worker Threads, you run the risk of race issues. The race condition being that a worker starts processing one ByteBuffer for a client and another comes in and a different worker starts working on a new ByteBuffer for the same client. If the contents of the second buffer depend on info of the first, which is likely in a stream protocol, then things fall apart. The solution is for the worker Thread to synchronize on an unique to each client Object until that worker is at least done processing the buffer.

So which was is better? On a single CPU server, blah’s way will probably perform better becuase of the lack of synchronization blocks. But, as you add CPUs you are more and more likely to waste CPU cycles when there is work to be done. Blah can add more threads to increase the number of subdivisions but he’ll get diminishing returns. I’m gonna guess he needs number of CPUs times 2 to be reasonably sure all CPUs are mostly busy. With synchronization you can get away with number of worker Threads equal to number of CPUs.

Now the question is, when does context switches between Threads become more expensive than synchronization overhead? I don’t know the answer. People have thought long and hard on how to optimize switching between threads. Synchronization has been optimized quite a bit too and is generally a simpler task as I understand it. I think switching between threads is gonna blow out your CPU cache more often, but I’m not convinced you can optimize CPU cache use in Java because it’s so high level. I definatly think blowing out the CPU cache is worse than synchronization overhead.

Well, now that I’ve followed the thought process all the way to here I’m not so sure blah’s method is really all that even for one CPU servers, but I’m working on a lot of speculation. Anyone have hard data to prove me wrong or a strong counter argument?

EDIT: small changes to try to not claim on way is better than another on a superficial level.

blahblahblahh · May 14, 2003, 6:35pm

Chuckle. Sounds like you’re in the same position I was until fairly recently - too knowledgable of how bad threading used to be on certain OS’s. When hardware and OS people now say “don’t use too many threads” they seem to mean “don’t go above a few hundred”.

FYI the design aims were:

simple to understand, explain, and reuse
great for single-CPU (assuming you tweak the size of each threadpool etc)
even better for multi-CPU (assuming you tweak the size of each threadpool etc)

That is a problem it was specifically designed to overcome; the only serial bottleneck’s are at the connect and parse stages. I did specifically make the comment about it being OK unless your parse was slow…

Thanks for pointing that out; I’m so used to avoiding complex sync issues that I didn’t think to mention it, but it’s a very important issue. My reasons for avoiding sync, in THIS case:

IN THIS CASE it is not trivially obvious where and why the synchronisation is necessary (to a maintainer), introducing some serious risks of accidental corruption by a different coder (or myself, if I’m away from this project for long enough to forget!)
synchronization usually has a considerable overhead, especially so on highly-parallelizable code. Not only do you prevent lots of useful compiler optimizations, you also cause significant problems if you have to transfer to a machine with more CPUs.
thread switching on modern systems is exceptionally fast, (see below)

blahblahblahh · May 14, 2003, 6:47pm

Nah; you need to update yourself on the threading capabilities of modern systems. Note:

modern consumer CPU’s have excellent built-in support for thread-switching (IIRC IA64 is really good, and even IA32 is very good)
modern consumer OS’s have dispensed with all the low-end “don’t use lots of threads or performance suffers” problems. Specialist OS’s are currently coping with THOUSANDS of threads
o Win2k and above have all been given decent threading (although I can’t confirm/deny how well they cope with extremely large-process-sizes [multi-gigabyte memory per-process])
o Linux is now pretty good at MT (although I’ve not yet tested this myself)
o Solaris is still the king of MT performance, apparently.

Assuming I’m handling the IO properly, this should work extremely well with any number of CPU’s. Intended for 1 CPU up to 32 CPUs - so I decided to go for user-tweakable numbers of threads etc.

Given the hardware support for multi-threading, you would be wasting performance (well, maybe not, but certainly wasting thousands of transistors ) if you only had one thread per CPU. I would not recommend it (unless you’re running Windows 95 ;)).

In the nicest possible way, whilst you clearly know what you’re talking about on the specifics, you’re barking up the wrong trees with your analysis. Also, I don’t think you really understand the specific issues of networking performance.

If analysing the performance of heavily loaded servers were anything like simple enough that you could focus on just two hardware issues (in this case thread context switches and CPU caches), then a lot of people would be very very happy and much less stressed. A lot of clever people would also be out of a job.

There’s little to be gained in looking at the hardware for this, because in practice the performance is highly sensitive to everything, and the moment you eliminate a software bottleneck, it’s replaced by a hardware one, or (more frequently) an OS one. And vice versa.

Threading and caches:

It’s common nowadays (has been for some years) to have very many (e.g. > 100) registers in your CPU, and be able to dedicate registers to threads. Context switch cost is minimal - although there are many who wish that all 128 (or even 512 in some cases IIRC) registers were available to the application programmer!

Specifics of networking performance:

Assuming it’s worth looking at the hardware aspect, your talk about CPU caches still surprises me. It’s been a a while since I looked at network hardware architectures, so forgive me if I’m unaware of today’s common practice… when I knew about this stuff, networking IO that went via a CPU cache was considered slow.

I thought this was the MAIN POINT of using NIO Buffers in networking - you get the system to do low-level IO routines (especially copy’s), e.g. bypassing the CPU? It has been known for a long time that if you’re trying to get data into memory, or from one type of memory to another (network cards are just a “special” type of memory, as are hard disks etc), then having a CPU in the way can only make things slower.

Unless I’m greatly mistaken, cpu-cache issues have no impact on the performance of the system I described. But I’d love to be corrected if I’m wrong… (for obvious reasons to do with improving the performance of my systems!)

jbanes · May 14, 2003, 7:02pm

Geez, blah. Where do you get all the time to type this stuff? Anyway:

[quote] - modern consumer CPU’s have excellent built-in support for thread-switching (IIRC IA64 is really good, and even IA32 is very good)
[/quote]
Actually, this is true for just about every CPU except the IA32. If you read Intel’s docs, the CPU has support for hardware context switching, but it’s so poor that they tell you to do your own context switching in software. Nice design, huh?

[quote] - modern consumer OS’s have dispensed with all the low-end “don’t use lots of threads or performance suffers” problems. Specialist OS’s are currently coping with THOUSANDS of threads
[/quote]
True enough. Most OSes still can’t handle more than 500-1000 threads tho. You could easily service this many using poll/selects spread across only a few threads.

[quote] o Win2k and above have all been given decent threading (although I can’t confirm/deny how well they cope with extremely large-process-sizes [multi-gigabyte memory per-process])
[/quote]
W2K threading isn’t bad. Unfortunately, it’s dragged down by the rest of the system which insists on paging memory whenever possible. I love going to a nice Unix machine and never even touching the swap file. At least W2K/XP don’t swap when you minimize. Development under NT was real interesting. :-/

[quote] o Linux is now pretty good at MT (although I’ve not yet tested this myself)
[/quote]
I’ll punt on this one. Don’t want to anger the Linux crowd…

[quote] o Solaris is still the king of MT performance, apparently.
[/quote]
Right on the money. You can throw threads at Solaris like they’re candy. The system just chugs along with no noticable performance drop. I don’t know who the guy was that came up with the Solaris thread design, but if I ever meet him, I’d like to buy him dinner. Now if Sun could just come up with consumer Sparc hardware, we could migrate the entire world to Solaris+KDE3! Well, at least we could get the set-top boxes (big market for SPARCs). ;D

leknor · May 14, 2003, 9:33pm

[quote]- even better for multi-CPU
[/quote]
As you increase the number of CPUs the performance increase with your method will dimunish at a faster rate than the “common” NIO algorithm. How much so, and is it an siginificant amount? I do not know.

[quote]- synchronization usually has a considerable overhead, especially so on highly-parallelizable code. Not only do you prevent lots of useful compiler optimizations, you also cause significant problems if you have to transfer to a machine with more CPUs.
[/quote]
This article claims otherwise and when you consider that the amount of work done in micro-benchmarks, it is insanely trivial compared to how much work you’d normally do peeling apart a frame the cost of entry and exit of a syncronization block isn’t considerable; it’s barely measureable. Combine that with the added flexiability to possibly choose an algorithm that more than makes up the difference and syncronization doesn’t look so bad.

Maybe we can get Jeff to weigh in on this issue.

[quote]If analysing the performance of heavily loaded servers were anything like simple enough that you could focus on just two hardware issues (in this case thread context switches and CPU caches), then a lot of people would be very very happy and much less stressed. A lot of clever people would also be out of a job.
[/quote]
I don’t claim that there are only two performance points. It’s just two points that as a programmer you have the ability to influence the programs performance by adjusting it’s interaction with the CPU with respect to the very small chunk of code we are talking about. I don’t make blanket statements about everything in one fell swoop. Blanket statements can almost always be found false¹. Please don’t continue to think I’m talking about more than I actually am. What I did question, and this supports your argument, is if it’s even possible to influence CPU cache interaction at the high level that Java is.

[quote]There’s little to be gained in looking at the hardware for this, because in practice the performance is highly sensitive to everything, and the moment you eliminate a software bottleneck, it’s replaced by a hardware one, or (more frequently) an OS one. And vice versa.
[/quote]
Bottlenecks are everywhere, no disagreement here. But when discussing a peice of code, hardware that the code doesn’t interact with or the software cannot control is beyond the scope of this dicussion.

[quote]It’s common nowadays (has been for some years) to have very many (e.g. > 100) registers in your CPU, and be able to dedicate registers to threads. Context switch cost is minimal
[/quote]
The actual switch is minimal. There is a side effect that when you switch context you are at a different point of execution and the relevent set of memory for each thread will be different. Accessing a different set of memory makes it more likely for a cache miss, which is expensive.

[quote]when I knew about this stuff, networking IO that went via a CPU cache was considered slow.

I thought this was the MAIN POINT of using NIO Buffers in networking - you get the system to do low-level IO routines (especially copy’s), e.g. bypassing the CPU?
[/quote]
The cost of the IO is the same for both you’re algorithm and the “classic” one; they both use NIO in basicly the same way. That said, I hope NIO gets the data from the NIC to RAM without involving the CPU.

Either way, the CPU gets involved once you read or manipulate that data. I see little utility in reading data from a network into RAM and then discarding it. Unless your server just forwards data around like a switch then any intersting work will require passing that data though the CPU. Also, since you cannot do pointer math in Java you have to copy data from the ByteBuffer to to variables to do anything intersting.

[quote]Unless I’m greatly mistaken, cpu-cache issues have no impact on the performance of the system I described. But I’d love to be corrected if I’m wrong… (for obvious reasons to do with improving the performance of my systems!)
[/quote]
A cache miss for the CPU is usually an order of magintude slower than a cache hit. Add that up many times and it’s not trivial. synchronization overhead tends to not be as many times as expensive. Did you ever have to study Big “O” notation for a programming class? If so, do the math and see where it takes you.

In the end your algorithm may be better than the classic NIO one. I don’t really know with out doing a lot of testing. You don’t know either. I’ve just laid out my reasons that I belive you may be wrong.

All things being equal, your algorithm is easier for programmers to not screw up with. That’s probably worth much more than some tiny fraction of a percent performance difference.

¹: I didn’t mean that as a joke but it’s kinda funny when you think of it as a blanket statement.

swpalmer · May 15, 2003, 2:49am

Clue me in if I’m way out to lunch here, but don’t Java synchronization details require a cache flush in some cases? Particularily on multi-CPU systems?

This may be something that has long since been dealt with, but I had heard that the Java memory model was a bit too strict in this area and that cache flushing was needed too often to comply with it. That being said I also heard that even Sun’s VM doesn’t comply with the memory model 100% because it is just too darn limiting in this area.

Anyone have up to date details about this stuff?

leknor · May 15, 2003, 5:53am

[quote]Clue me in if I’m way out to lunch here, but don’t Java synchronization details require a cache flush in some cases? Particularily on multi-CPU systems?
[/quote]
I may be mixing my understanding of x86 cache with the IBM PowerPC cache but I believe that the cache is managed in 4K chunks. So when you aquire a sync lock, even though it may only be updating a few bytes in RAM a while 4K will be pushed across the bus. Yea, that is wasteful but it’s far from being the whole CPU cache.

[quote]I had heard that the Java memory model was a bit too strict in this area and that cache flushing was needed too often to comply with it.
[/quote]
My understanding of the problem is that in a non-syncronized code block the Java Language Specification (JLS) allows a reference to an Object to be established before the Object has been initilzed. In a threaded enviroment it would be possile for a read of a String object to return “foo” or whatever happened to be already in memory one moment and then once the String is done initilzing read “bar” the next moment. This could be bad if the STring represents a file name to be deleted and the first read was a security check.

[quote]Anyone have up to date details about this stuff?
[/quote]
I think this is the best source: http://www.cs.umd.edu/~pugh/java/memoryModel/

blahblahblahh · May 15, 2003, 8:47am

Thanks for the article; interesting that people still believe sync was 50 times (!) slower than non-sync - I only worry about it being 1.5 - 2 times slower, up to perhaps 3 times slower in worst-case-scenarios. I was going by the theoretical side-effects of synchronization, simply because the actual performance of it seems to keep changing from release to release - and I can’t keep up :).

I know of platform-dependent issues w.r.t. synchronization which muddy the waters even more. Depending on the OS (and the JVM vendor!) certain approaches to implementing sync can cause lots of accidental performance problems, because they interfere with the OS’s behaviour and/or assumptions.

Basically, until Sun disseminates a lot more information on sync implementations - “Facts and Figures: no benchmarks” - I don’t trust it in a highly-parallelized environment. The issues I mentioned originally are theoretical, and hence unavoidable. It could be that the latest hotspot compilers use new compiler optimizations that give sync’d code practically the same performance as non-sync’d code - but I do know that you would need NEW optimizations; there are plenty of standard ones that get screwed by sync’s.

Sorry if my response came across badly… I didn’t fully explain. The key point here is to use the classic approach to optimization - look at each possible optimization, work out the saving on each, and then apply a weighting to each saving, where the weighting is proportional to how often that saving will actually be realised in practice. AFAIAA, the optimizations you were talking about are rarely the bottleneck in a networking server, and so the optimizations there become weighted to almost nothing.

However, it varies so much from system to system that they could well be the problem on some systems - hence you have to look at each system separately, work out the amount of bottlenecking at each point in the system and THEN start looking at optimizations by area. It’s a little like optimizing an app, except that it’s much much harder to profile, and much much harder to predict bottlenecks :(.

I’ve been hoping that NIO might herald a new direction for Java - towards more voluntary use of low-level things. In particular, perhaps one day we might see cache-hinting? Although, realistically, I suspect that Hotspot etc are taking the execution stage so far away from the raw byte-code (i.e. it’s very very hard to predict what your compiled byte code will look like by the time it gets executed) that it would be hard to achieve.

PS Wonder if the “register” keyword will ever get a meaning? …and all the other unused reserved words…

Yes, but…Not if the hardware has a bottleneck so huge that it completely dwarfs any savings you may make. E.g. if there is hardware you cannot control that is soaking up 95% of the performance at some point, even if you make the 5% 100 times faster, you only improve performance by approx 5%.

(just to be clear, I’m not claiming 95% is typical, it’s just an extreme example. However, large percentages are certainly possible)

True. AFAIAA this same problem is the main reason why Xeon chips have so much RAM - oops! I mean cache :). Because the bus is shared, the chips in an MP system have to pre-cache as much RAM as possible so that they can survive until their next chance at the CPU-RAM bus without getting blocked.

However, modern L2 caches generally are large enough to satisfy quite a lot of threads - this is actually a really good reason for having a large L2.

Yeah, but when you’re writing a server app you don’t usually care about what happens once the data is in memory. Your network card is so incredibly slow that you want to do everything you can to avoid waiting for it. I’m in a situation where my worker-threads sometimes do a lot of work for a single request, so I do actually have to be careful about CPU performance, and attempt to increase parallelism…but the cost of a context switch (including associated knock-on-effects) is vanishingly small compared to what else I’m doing in the worker thread…

blahblahblahh · May 15, 2003, 8:47am

If I’d found a classic NIO algo, I would have used it. All I’ve seen so far are tutorials that quote the same tutorial-code, and Sun’s examples which are crud, hardly using NIO at all. Well. They call the API’s. But they don’t take real advantage of the new features. I find this VERY frustrtating.

So, this algo is an attempt to bring some of the wisdom of high-performance unix server designers to java; I’ve had to modify classic approaches a little to fit the fact that e.g. java’s select() is not quite identical. I’m pleased at the chance to open it up for criticism on JGO because, as Herkules remarked, it’s rather hard to be sure you’re really using NIO well - there are few experts on the java impl around at the moment!

I’m writing a chapter for a game-development book on high-performance servers at the moment, and would like to provide an example using java NIO; but only if I can get a really good one together (there’s plenty of poor examples around already, I don’t need to add to the pile ;)). For this reason, although my immediate concern is a good algo for the project I’m using it on, I also have a long-term interest of making it REALLY good. I’ll also back-port any corrections to the other project where I’m using a similar approach.

It’s very nice to hear that :). It’s one of the critical requirements of this project :). And although the book will be read by lots of hard-core game programmers, it will also be read by relative newbies, so if I want to reuse any code, it’ll have to be non error-prone :).

leknor · May 15, 2003, 7:36pm

[quote]I’m writing a chapter for a game-development book on high-performance servers at the moment
[/quote]
Best of luck with that.

Below is a half rant that could be misinformed but I belive it’s more informed than most rants about synchronization. It’s also a chance for you to poke holes in what I think would be optimal non blocking NIO processing.

Synchronization is part of a method signature or implemented with the monitorenter and monitorexit byte codes. Anyway what is not free is the transition across a synchronization boundary. Once you’re inside or outside there is no performance difference.

People sometimes assert that synchronization prevents the JVM from doing optimizations and that is true when the optimization would have been across a synchronization boundary. For example a JavaBean getter method is often just a return of a private field and is commonly inlined. Making a getter synchronized is relativity expensive because you cross a boundary, read one value, cross back across that boundary. That boundary complicated optimization. If an optimization would happen entirely inside a synchronization boundary then the optimization is no differnent than any other. (This sorta conflicts with the advice of keeping syncronization blocks as small as possible. I think that advice stems from the fact it is often very hard to determine how long something will block because of contention.)

In what I consider more intelligent use of synchronization with respect to NIO is to synchronize on the Object attached to a SelectionKey. In the thread that watches the Selector have something like:

 // This method is only called by the Thread that works the Selector

public void processSelectionKey(SelectionKey key) {

  ConsumerThread pooledThread = // from the worker pool
  Channel channel = key.channel();
  ChannelConsumer consumer = (ChannelConsumer)key.attachment();

  pooledThread.consume(consumer, channel);
}

Then hand off to the worker ConsumerThreads and immeaditly synchronize on the consumer and start running it:


class ConsumerThread extends Thread {

public synchronize void consume(ChannelConsumer consumer, Channel channel) {
 if (this.consumer != null || this,channel != null) wait();

 this.consumer = consumer;
 this.channel = channel;

 notify(); // Wake this thread up
}

public synchronize run() {
 Channel channel;  // A NIO Channel
 ChannelConsumer consumer; // Something specific to your project that reads from a Channel

 while (!done) {
  synchronize (this) {
   while (this.consumer == null || this.channel == null) wait();

   consumer = this.consumer;
   this.consumer = null;

   channel = this.channel;
   this.channel = null;

   notify(); // Wake the consume method up if needed

  } // synchronize (this)

  synchronize (consumer) {
   consumer.consume(channel);
  } // synchronize (consumer)

  // Ok, we're done, release this ConsumerThread back to the thread pool.
 } // while (!done)
}
}

So what do we have? Three points of synchronization. The first synchronization points could block the Thread that works the Selector but if you look at the second synchronization point you’ll see that the window for contention is very small. The third point of synchronization is to prevent second entry of the consumer at the same time. Once you get past the third synchronization point the code flow is the same as blah’s design.

So if my assertion that having the number of threads be closer to the number of CPUs is best for performance holds up then compared to the increased work done will out weigh the synchronization points.

Also, I’m still convinced that Blah’s Selector per thread model will create more latency processing the queue of ready Channels because if it happens that a bunch of activity is all being worked by the same Thread then others CPUs will waste cycles. The model above automatically balances the work load so it gets to each WorkerThead evenly.

Blah could detect when a worker thread gets backed up somehow and then unregister a Chanel from one Selector and put it to a non-busy one but I think that would be hard to do without creating a window for notification of new data to get lost.

Now, it is still possible for there to be a big block from contention on my thrid synchronization point. This is less likely with TCP becuase reads and writes tend to be coalesced, since it’s a stream. In other words the probably if two updates on the same connection arriving back to back is slim.

UDP is different, read/writes are atomic and you are more likely to have back to back reads on the same connection. My first guess is to reccomend haveing the number of worker threads equal to number of CPUs plus one. That guess is based on the advice to parallize gnu make to number of CPUs plus one when compiling large C/C++ programs because it does better at keeping all CPUs busy while the newest compile process is waiting for disk I/O to complete.

EDIT: Made the code samples less ambigous.