NIO and wait share interrupt flag >:(

blahblahblahh · April 17, 2005, 9:43am

In a way, I can’t believe it’s taken me this long to realise the horror of this setup, but I’ve just been bitten by it for the first time:

NIO and Object.wait() share the same messaging primitive, so you can only use one or the other, not both…

Thinking about it now, I’m lost for words: I must be missing something obvious, because it’s insanely stupid otherwise.

To make NIO work properly, you need to interrupt your threads, since Sun (in their infinite wisdom) decided to allow NIO to block indefinitely if you chose to touch the selector (IIRC there’s no particular reason for this, other than that it made it easier for them to implement their own API).

But that means you cannot use wait/notify anywhere in your select’ing thread, because the JVM has no means to distinguish between “interrupt any blocked NIO” and “interrupt any blocked monitor”. And I’ve just tried using a select’ing thread that, deep down in other code many classes away, performs a wait() for a brief period.

(yes, I know I shouldn’t, and I will convert it to not using wait eventually and being purely asynch, but there was no reason why wait shouldn’t work that I could see at the time and it would save me significant implementation time for now, at the cost of some performance at runtime).

I’m now sitting here wondering if I can think up a hack to workaround this that doesn’t have some obscure synchronization bug in it :(.

…or if the GrexEngine selectors need to be rewritten not to ever ever interrupt (which I don’t think is possible) or to do extra checks before interrupting using some internal hack to make up for the fact that Sun shared the blocking flag between two totally unrelated systems :(.

Any ideas would be welcome at this point, especially if someone spots an obvious mistake I’m making

blahblahblahh · April 17, 2005, 9:46am

PS: I can see one obvious way out: getting the GrexEngine core rewritten so that it uses select(long) everywhere is a solution but practically useless (and not something that would be worth doing to the GE I suspect) … in order to make services responsive, that long would have to be around 1 ms, or much less, so effectively you wouldn’t have a selector at all but a highly inefficient poller.

kevglass · April 17, 2005, 9:48am

To me, sounds like misuse of the interrupting really. If you’re interrupting a thread its because you know its waiting in a state that is now invalidated because of some other event.

If you’re not sure what state its in how come you’re interrupting it?

Hmm… maybe I’ve just only ever had naive use for it in the past.

Kev

blahblahblahh · April 17, 2005, 10:01am

Ah! Found it :). Selector.wakeup (maybe…haven’t checked fully yet)

But … now I need to go and find why the GE base-service I’m using doesn’t use it, but instead uses interrupt ???. Perhaps…wakeup didn’t work on all platforms at some point, and this is a legacy workaround?

/me needs to get hold of the design notes for the GE internals and check…

blahblahblahh · April 17, 2005, 10:11am

/me has been very stupid

The GE I/O code all uses wakeup, as you would expect.

But… the base service I’m extending is a threaded class that accepts workloads asynchronously (nothing to do with NIO). It’s such a basic class that I had been extending without even bothering to look at it carefully :-[

It needs to idle whilst there’s no work to do, and the base class has a sleep( 10 seconds ), whilst the work-accepting method does an interrupt each time new work comes in.

This should, AFAICS, be a wait-on-incoming-queue and notify-when-queue-added-to. The fact that it’s interrupting instead of using wait/notify as you’d expect is what then caused pain later on, and caused this mess.

/me goes off to check if there’s a particular reason it isn’t wait/notifying…

Jeff · April 18, 2005, 3:56am

YUp.

Thats just BAD Java code.