Should you risk using NIO for hard-core networking

Which select method have you guys used? select() vs. select(long) vs. selectNow()? Have you noticed any implementation issues between them? From my testing I have found that select() often blocks too long and so switched to doing selectNow() followed by a thread.sleep() to rest a bit and saw significant connection throughput. Anybody else seen something familiar?

Ah. You are not quite correct in thinking that there is no performance alteration once you are inside a sync’d block.

Once a thread is inside a sync block, various things happen. One of these is that it now cannot be pre-empted by any other thread that syncs on the same key. In a multi-threaded environment, that can potentially be the kiss of death - one way of looking at it is that you are in fact moving from a windows NT/2k/XP scheduler to a windows-3.1 scheduler (for those threads only). For the threads that sync-conflict, you deliberately disable the advantages of intelligent scheduling, and force your current thread to hog the system until it completes it’s block. This is a really really good reason to use sync’ing sparingly!

Thus, especially in a highly MT environment, there is a performance loss IN SOME WAY proportional to the length of time you spend inside sync blocks. The proportion is governed mainly by how often other threads WOULD have been scheduled to have CPU time instead, but other factors can come into play; if the sync implementation is good, I suspect that external factors should have little effect, but I’m afraid I really don’t know - back to that point of not being able to keep up with the state of the art in current VM’s!.

The best thing to do to avoid this problem is have synchronisation on as many different keys as possible - thereby reducing the frequency with which you get contention to use any particular sync-block, which hamstrings the scheduler. You should also spend as little time as possible inside sync’d blocks.

Depending upon what you actually DO inside the sync’d block, you could also temporarily screw your whole OS performance - if you did some blocking IO, for instance, which can easily tie up the CPU for many thousands of operations, then you rapidly waste CPU cycles. This particular situation CAN affect all threads in the system. In a perfect world, non-java threads would NOT be affected. However, if a JVM uses the underlying OS’s sync primitives - or the hardware sync primitives - then it is feasible.

One of the things that recent JVM’s have been doing to improve sync performance MIGHT be that they implement their own sync, instead of using the OS, but still using the hardware primitives. Shrug. Maybe.

Some more info on scheduling of threads (pretty simplistic, and sadly not authoritative on the details - prefers to leave them out, probably wisely!):
http://www.javaworld.com/javaworld/jw-07-2002/jw-0703-java101.html?

Whilst trying to find some facts and articles, I found an interesting quote by ond of the inventors of the Monitor for concurrent programming (which java’s sync is similar to, but not identical):


"It is astounding to me that Java's insecure parallelism is taken seriously by the programming language community a quarter of a century after the invention of monitors and Concurrent Pascal. It has no merit".

(found in Sun’s “High performance Java” book, chapter 4).

When you say UDP does “atomic” read/writes, what do you mean? Do you mane you cannot do non-blocking UDP with nio? A while ago I asked Sun whether you could or not, and got no reply; the API docs don’t say either way.

PS I’m hoping to have spare time (but not get any for a while :frowning: ) to look at your sync-based code carefully, but I have great problems with reading java code that uses wait() and notify() - they are appallingly named methods, from a code-reading perspective. I’ll need plenty of time and coffee to make sure I’ve read it properly :)…

[quote]Ah. You are not quite correct in thinking that there is no performance alteration once you are inside a sync’d block.

Once a thread is inside a sync block, various things happen. One of these is that it now cannot be pre-empted by any other thread that syncs on the same key. In a multi-threaded environment, that can potentially be the kiss of death - one way of looking at it is that you are in fact moving from a windows NT/2k/XP scheduler to a windows-3.1 scheduler (for those threads only). For the threads that sync-conflict, you deliberately disable the advantages of intelligent scheduling, and force your current thread to hog the system until it completes it’s block. This is a really really good reason to use sync’ing sparingly!
[/quote]
Yea, this is called contention. It’s when you try to sync on an Object but the lock is already taken. The Thread that failed to aquire a lock goes to sleep until the lock has been freed, which is the desired behavior. Saying “[it] cannot be pre-empted by any other thread that syncs on the same key” makes it sound like a horrible thing. It’s not. I’m pretty sure it’s the same effect as any other language’s locking mechanism.

In my code there are three possible place for contention but I think it is rather rare that it will happen.

1: If a ConsumerThread instance has been put back to work before it has made it from the end of the while (!done) loop and back to the while (…) wait() loop you could have some contention. The Thread that works the Selector would have to be in the ConsumerThread.consume(ChannelConsumer, Channel) method after the if (…) wait() line in that method. It’s a very small window. So small I’ve pondered trying to not synchronize it but I know it Not the Right Thing To DoTM and can lead to subtle bugs.

2: The opposite of above. If the ConsumerThread is in between the while (…) wait() line and the end of the synchronize (this) block the Selector worker thread could have to contend for the the lock when it calls ConsumerThread.consume(ChannelConsumer, Channel). Again, a very small window.

3: There could be contention at synchronize (consumer) in the ConsumerThread.run() method if the time it takes to process consumer.consume(channel) is longer than it takes for the next set of data to come in on the same channel and get distpatched to a different ConsumerThread. While this could be likely it gets less likely as you get more clients connected to your server and the faster you return from the consumer.consume(channel) method.

[quote]When you say UDP does “atomic” read/writes, what do you mean?
[/quote]
I mean each UDP packet results in a unique read. It’s what I say in the 3rd post in:
http://www.java-gaming.org/cgi-bin/JGOForums/YaBB.cgi?board=Networking;action=display;num=1045163896

This is different than TCP which is a stream of bytes and it just happens we tend to interact with them in byte[] chunks at a time. Since TCP is a stream NIO (and your operating system) is free to optimize the transfer by combining smaller chunks into larger chunks to reduce the number of seperate packets or seperate reads needed to move the data around.

I don’t know if the above happens in NIO, as far as I know it isn’t specified and would be an implementation detail. One possible example is a bunch of channels have data ready to be read. Before your code gets around to reading the data for a channel, more data comes in. Does NIO append the new data to the end of the existing to-be-read data? I think it would be more efficent if it did because then we wouln’t have to go though the whole selection process again to get data that is already local.

BTW: that Java NIO book has had the anser to all NIO questions I’ve had soo far. While the author’s writing style isn’t to my tastes I cannot complain about the quality of the content.

[quote]Do you mane you cannot do non-blocking UDP with nio?
[/quote]
I’m currently doing non blocking DatagramChannel I/O in a pet project of mine.

[quote]PS I’m hoping to have spare time (but not get any for a while :frowning: ) to look at your sync-based code carefully, but I have great problems with reading java code that uses wait() and notify()
[/quote]
I’ll admit I’m not the strongest programer when it comes to using those methods but but I’m reasonably sure I’m using them right is this rather simple case. I am assuming that there are no external threads interating with the mentioned code fragments. If this were not the case I belive it would be prudent to use notifyAll() instead.

As if this wasn’t long enough already here is one more quote from the Java Performance Tuning 2nd Edition book from O’Reilly. In reading the Synchronization Overhead section he says a lot of bad things about the performance characterists of Synchronization. BUT at the end of that section on page 291 he does say:

[quote]The 1.4 server-mode test is the only VM that shows negligible overhead from synchronized methods. […] On the other hand, I shouldn’t underplay the fact that the latest 1.3 and 1.4 VMs do very well in minimizing the synchronization overhead (espically the 1.4 server mode), so much so that synchronization overhead should not be an issue for most applications.
[/quote]
(The emphasis is mine.)

I feel like a broken record but I’ll try to sum it all up one more time. If I’m wrong you can come over to my house and beat me up.

Synchronization with out contention is cheap, not free, but cheap enough that flexibility in being able to pick a better algorithm may lead to a net performance gain. Synchronization with contention is expensive because it is effectively serializing what would be an otherwise parallel task(s). Understanding the probable contention characteristics is the only real way to make an educated choice.

(slight digression, but relevant…) Having finally had time to read the theses and papers regarding SEDA, I seem to have been heading towards the same solution as them (a Staged Event Driven Architecture - although I’d prefer to include the word Message in there, if I were naming it, given it’s heritage).

I did actually do a design for precisely such a system, but haven’t implemented it yet; my current work (as described in this thread) is a cut-down version; the big question is how many of the benefits of SEDA-like systems my cut-down version achieves :). If the answer turns out to be none/few, I’ll probably start using hte SEDA libraries :).

http://www.eecs.harvard.edu/~mdw/proj/seda/index.html

I suggest reading the “SOSP’01 paper on SEDA” (link on that page, near the bottom of the main text) to start off with…

Alright, I’ve used NIO in anger a few times for commercial projects, and each time I’ve ended up with a slightly different architecture.

So, this time round, I decided I’d start documenting each of the different architectures I’ve tried, so that next time I can just do a pick-n-mix job. Hopefully I can also build up a library of do’s and dont’s for each architecture. I’m trying to cover different strategies for using Buffer’s, and also strategies for network-server pipelines, and any and all combinations of those two.

Is anyone interested in seeing/sharing this stuff? Some of the responses on this thread suggest to me yes. If so, I’ll put up a webpage, and volunteer to collate, edit, re-format and maintain architectures from different people, with comparitive pros, cons, feedback, etc.

I’m only suggesting this because I’m fed up waiting for sun to produce good quality network NIO docs, and can’t really afford to wait any longer :). If there’s a site out there that already does this, PLEASE let me know ! :smiley:

[quote]Which select method have you guys used? select() vs. select(long) vs. selectNow()? Have you noticed any implementation issues between them? From my testing I have found that select() often blocks too long and so switched to doing selectNow() followed by a thread.sleep() to rest a bit and saw significant connection throughput. Anybody else seen something familiar?
[/quote]
Only just got around to looking at this…

I would be really really surprised if select blocks “too long”… I often use it with no problems at all (instant response), but that’s because I’m fond of high-performance, and select(long) is always doomed to be a low-performance method for the vast majority of apps (it forces you into a poll-based approach; select() forces you into an event-driven approach - c.f. the SEDA reference I posted for info on advantages of event-driven; its the ED in SEDA ;D).

What you describe, using selectNow combined with a sleep, is obviously a poll :slight_smile: yeuck.

If I were to hazard a guess, I’d say you’re having a problem I had in a much more extreme version - so extreme, it blocked my first NIO app indefinitely. This was bad, so I dug further, and noticed that some (note: not ALL! This is why you don’t necessarily find it!) of the NIO methods have notes about how they block on synchronized method calls in seemingly unrelated classes (sometimes the notes are pretty vague). Several methods block on not just one but two (or even three, IIRC?) other methods.

I.e., if you call method a() and then call b() or c(), they will block until a() returns. Usually, a is in a different class, and b() has two versions. E.g. b() and b( long blah ). In some places, ONLY the documentation for the one with most arguments actually mentions the blocking behaviour. The other versions all tell you to read that version IIRC, so you are pointed in the right directions…

Look carefully at ALL the jdocs for the following methods, and then look at your source:

  • register
  • select
  • interestOps (yes, really! And there’s some pretty shitty hand-waving going on in this one…)
  • keys (no comment on the method - you HAVE to read the class comment. This is really bad documenting style :()
  • selectedKeys (no comment on the method - you HAVE to read the class comment. This is really bad documenting style :()
  • readyOps
  • selector

Yes. I would be VERY interested in seeing this, blahblahblah.

Having not found the docs I need to be able to produce good networking code, I can’t tell you how much that this would assist me.

[quote]Yes. I would be VERY interested in seeing this, blahblahblah.

Having not found the docs I need to be able to produce good networking code, I can’t tell you how much that this would assist me.
[/quote]
OK, I’ve put a poor initial version up at:

http://grexengine.com/sections/people/adam

I haven’t had time to collate all my notes - or even to document anything more than the most recent approach I’ve used - but I’ve checked it over and it doesnt say anything that is actually WRONG at the moment (AFAIAA).

This could be a significant improvement over many of the existing resources on writing network code in java :), many of which are “ill-informed”, to be polite ;).

Actually, I might also include a pruned version of the TCP/UDP debate I waded into last month, because a lot of people seem to be reading outright lies about TCP/UDP - and end up doing foolish things through no fault of their own. If I cut and pasted, and reordered and reformatted (for clarity) the “best bits” of that thread, do you think it would be worth adding too? I have no personal need for a TCP/UDP reference (this stuff hasn’t changed from the pre-java days) but perhaps other people might appreciate it?

Just a quick glance through indicates to me that this is the kind of stuff we REALLY need to be able to post on here, as part of the “WEBSITE” and not the “FORUMS.”

Yes, I think we need to somehow get together and start seriously thinking about getting up GOOD articles on how to program games, including all aspects such as networking.

I’ll respond with more info after I have read your article in detail, and I may fire off an email to Chris about what the scoop is on the website itself.

[quote]Just a quick glance through indicates to me that this is the kind of stuff we REALLY need to be able to post on here, as part of the “WEBSITE” and not the “FORUMS.”
[/quote]
Ok, this will be the one time I say this for this thread. Give me a Wiki. Then we are self-publishing and it’s easy to evolve documents. It’s not perfect but it’s a good fit.

Leknor, is there anything that prevents us from making our own Wiki? If not, why don’t we just start one?

is there anything that prevents us from making our own Wiki?

I won’t put words in Leknor’s mouth, but Wiki software is pretty much freely available, just like forums software. The only thing preventing somebody from hosting one is the physical server and bandwith. Do you have a T1 you’d like to donate? :slight_smile:

God bless,
-Toby Reyelts

T1? No. Server? Quite possibly. I’ll start researching. If this happens I will be hosting it on www.equinoxesolutions.com which is my corporate site, but doesn’t see THAT much traffic. It’s currently sitting on a DS3 for now, with a redundant DS3 going in in July, from what my provider tells me.

[quote]Leknor, is there anything that prevents us from making our own Wiki? If not, why don’t we just start one?
[/quote]
Nope, I just don’t want anyone to feel like I’m trying to hijack a community or spam one about another one. I like JGO and want to add to it, not spread it out.

If people want I’ll create jgo.leknor.com and put TWiki on it. It’s perl based like YaBB which should help future JGO intergration and is good enough for IntelliJ. There are a lot of Wiki’s out there and it’s hard to know which ones are worth a damn.

RE-EDIT:

I’m installing TikiWiki, a PHP-based Wiki since I prefer PHP to Perl any day of the week. It’s quite powerful, GNU LPGL, and looks to serve our needs.

More info in a different topic more appropriate to this soon.

Further Wiki discussion in the “General Announcements” category.

This sounds good; but wikis might not be good as the mechanism.

I’ve used wikis a fair bit before, and they are a great tool for lots of situations, but have significant problems with writing coherent and/or authoritative literature. The biggest problem is that any document/webpage that seeks to help people by informing them and comparing and contrasting different approaches MUST be moderated, and this pushes it away from what Wiki’s are best at.

I’ve participated in using wiki’s to compose documents before, where one author (or maybe a couple of them) then created a new document based on everything in the Wiki. This works well.

However, with something like this, where accuracy is critical, and it’s really easy to make mistakes, I think it’s important that we have everything confirmed/verified by someone else before telling people about it.

…howabout using a wiki for a “suggested architecture/pattern/etc” area, with each submission including a test case? Then, as each test case + idea etc gets verified, they can be transferred into a moderated document?

So, if you need accurate info, you read the latest version of the doc. If you want to see what new ideas people are kicking about, and feedback into them, and/or fix bugs in their source/testcase, you go to the “submissions” area.

Shrug. Just throwing out ideas…

Why did this thread die?

I essentially cleaned down my sample code I was working with so that there is only 1 thread with one selector for all 3 operations: OP_CONNECT, OP_READ, OP_WRITE. I know this is not the best approach, but wanted to remove any chance of having MT issues. The following code was adapted from a Sun developer example and modified to keep track of how long it takes for each operation.

The general flow is the following:

  1. The Server starts up and listens for connecting sockets.
  2. Once the predetermined number of connections have connected to the server, the server then sends 1 byte to all the connections telling them to send their payload. This is the initialization phase. After sending the bytes, the connections selection key interest ops is set to read.
  3. The clients upon reading this one byte from the server, then switch their interest ops to write, and send their payload to the server. This is the clients write phase. The clients then switch interest ops back to read to be ready for the response from server.
  4. The server then reads the payload for all the clients. This is the servers read phase.
  5. When all bytes for a client connection have been read, the server immediately switches that connections selection key interest ops to write and writes back all the bytes received(simple echo). This is the servers write phase.
  6. The client then reads all the bytes sent back from the server. This is the clients read phase.
  7. On the server, once all bytes have been read for all client connections, the server repeats the initialization phase (#2 above ) all over again to repeat the process up to the number of trips pre-determined.

Each phase keeps track of how long it takes to perform its operation based on System.currentTimeMillis() taken before and after each operation. Then after each trip is completed, a print out is done summarizing the results.

The Server Trip results contain:

Total time: The time between the servers first read operation and the servers last write operation for this trip.
Init time: The time spent actually writting the 1 initialization byte to the sockets.
Read time: The time sent actually reading the payload from the socket, not counting time between select() opeation.
Write time: The time sent actually writing the payload from the socket, not counting time between select() opeation.
Read Sel Time: The time between the first read operation on a key and the last read operation on a key. This DOES include time spent inside the select() method as well as time spent inside the read()
Write Sel Time: The time between the first write operation on a key and the last write operation on a key. This DOES include time spent inside the select() method as well as time spent inside the write()

The Client process is a group of connections and has the following trip summary:

Read Time: This is the sum of all clients time to finish reading in the payload. This is misleading as there is the chance of time overlap involved here.
Write Time: This is the sum of all clients time to finish writing the payload to the socket. This is misleading as there is the chance of time overlap involved here.
Round Trip Time(RTT): This is the sum of all clients time between their final write time and their first read time. This is essentially the time take to go across the network, be processed by the server, then get back to the client. This is misleading as there is the chance of time overlap involved here.
Total Time: This is the sum of all clients time between their first write time and their final read time. This is essentially the time take to write all bytes, go across the network, be processed by the server, then get back to the client, and all bytes read in. This is misleading as there is the chance of time overlap involved here.

The Avgs are then printed out. This is a better understanding as it is the total values printed in the previous line and explained above. This is the values above divided by the number of clients.

The next line gives more insight:
Just Reading: The time sent actually reading the payload from the socket, not counting time between select() opeation.
Just Writing: The time sent actually writing the payload from the socket, not counting time between select() opeation.
Total Read time: The time between the first read operation on a key and the last read operation on a key. This DOES include time spent inside the select() method as well as time spent inside the read()
Total Write Time: The time between the first write operation on a key and the last write operation on a key. This DOES include time spent inside the select() method as well as time spent inside the write()

REASON FOR THIS POST:
Now, after explaining all this, the purpose of my post is to get some insight into the numbers I am receiving from running this test. I am using the sample code below to do some stress testing on how many connections an advanced gaming server built on NIO can handle. The tests I have ran have been rather dissapointing, which leads me to think there is something going on that I can’t see.

For example: When running the test with 2000 clients, sending 100 bytes each, sent 100 trips, I will see the server taking anywhere from 5 to 20 seconds to read in all the bytes, and write them all back to the clients. It appears that there is alot of time spent inside the select() method waiting to be notified for the operation to take place. The actual reads and writes aren’t taking that much time. Because there is only one thread, that rules out MT deadlocks.

I have ran this test on Windows, Linux(Red Hat), and Solaris, all with similar results. All on my company’s 100Mbs(I think ??) network. I have had my Network Admin sniff the network to see if there is any lags in the network and he assures me there is nothing slow or fishy going on.

Any insight or improvements on the code is welcome. Or any other code to test how many connections NIO can handle. My objective is to see how many players can send up their scores to be processed and ranked within a 10 second window.

Thanks for any insight. Sorry for the long post.

The code will be the following post

[quote]Why did this thread die?

I essentially cleaned down my sample code I was working with so that there is only 1 thread with one selector for all 3 operations: OP_CONNECT, OP_READ, OP_WRITE. I know this is not the best approach, but wanted to remove any chance of having MT issues. The following code was adapted from a Sun developer example and modified to keep track of how long it takes for each operation.
[/quote]
If I understand correctly, to summarise (and slightly over-simplify):

1: lots of clients get connected to the server, and go into a wait() situation.
2: once all are waiting, server does the equivalent of a notifyall, to get ALL of them simultaneously to do their transfers.
3: …server waits for all to get back into the wait, then starts again

Questions:

  1. Are your clients and server separate machines, with fully switched connection (as opposed to hubs)?
  2. What happens if you double the number of physical client machines, halving the number of client-apps running on each?
  3. What’s the LAN saturation like?
  4. Any collision-storms going on? (sounds like you have an admin who would have net-management tools to give you this info).
  5. Have you tried having two selectors on the server, one for read, one for write? Changing the interestOps for a key is a trivial method call hiding a non-trivial implementation (note that it has at least one blocking point just to change the interestOps!)…it could be that there is some delay added by the switching back and forth of all those interest sets.

However, I suspect the problem is something completely different and hope that someone else will reply with a “nah, this is the problem:” post instead ;).

PS the thread isn’t dead, just sleeping ;)…personally, I’ve suddenly received a couple of new deadlines - and haven’t had time to come back to this thread :(.