In a way, I can’t believe it’s taken me this long to realise the horror of this setup, but I’ve just been bitten by it for the first time:
NIO and Object.wait() share the same messaging primitive, so you can only use one or the other, not both…
Thinking about it now, I’m lost for words: I must be missing something obvious, because it’s insanely stupid otherwise.
To make NIO work properly, you need to interrupt your threads, since Sun (in their infinite wisdom) decided to allow NIO to block indefinitely if you chose to touch the selector (IIRC there’s no particular reason for this, other than that it made it easier for them to implement their own API).
But that means you cannot use wait/notify anywhere in your select’ing thread, because the JVM has no means to distinguish between “interrupt any blocked NIO” and “interrupt any blocked monitor”. And I’ve just tried using a select’ing thread that, deep down in other code many classes away, performs a wait() for a brief period.
(yes, I know I shouldn’t, and I will convert it to not using wait eventually and being purely asynch, but there was no reason why wait shouldn’t work that I could see at the time and it would save me significant implementation time for now, at the cost of some performance at runtime).
I’m now sitting here wondering if I can think up a hack to workaround this that doesn’t have some obscure synchronization bug in it :(.
…or if the GrexEngine selectors need to be rewritten not to ever ever interrupt (which I don’t think is possible) or to do extra checks before interrupting using some internal hack to make up for the fact that Sun shared the blocking flag between two totally unrelated systems :(.
Any ideas would be welcome at this point, especially if someone spots an obvious mistake I’m making