Most of the webstart problems are fixed by trying “outside” JGF for reasons less obvious than you think. For instance, if you load a JNLP from your hard disk you completely bypass MSIE. For instance, if you load a JNLP that is 100% broken BUT it is working just enough for webstart to pick the HREF out of the jnlp tag then it will IGNORE your jnlp and use the one it sees online, etc etc. There are many many issues that make it very hard to test and debug webstart issues at the webstart client, including of course all the problems to do with the fact that the webstart client is running your app in a rather funky and bizarre manner.
Anyway, tonight I ran 3 hours of stress tests over my LAN against the server, and found some interesting issues.
-
every page that uses templates is 4 times slower than pages that do not, even though the latter use 4 or 5 NONCACHED MySQL queries each (read: makes them very slow). Velocity is terrifyingly slow as a caching template system, and we may have to dump it (it’s also full of horrendous bugs, as almost everyone has seen by now)
-
there is what appears to be a race condition somewhere in the system. It triggers entirely randomly, anywhere from 161 requests into the run up to 15,000 requests into the run. It causes the JVM to hang completely with no exceptions, no errors, and no CPU usage. Hence I suspect it’s a deadlock, although I have no idea whether it’s my code, grexengine code, or the base libraries I’m using. This could be very painful indeed to track down.
-
I believe I’ve worked around the main cause of the frequently recurring outofmemory error by implementing my own memory manager to manage bytebuffers. I am unhappy that Sun didn’t provide any memory management support for NIO at all and that I’m back to digging out old C textbooks to refresh how to do memory management effectively.
However, I noticed in passing that there are many places in user code that are creating implicit bytebuffers. Because Sun have no support for mem management and appear to have flaws if not outright bugs in their garbage-collection of these things this “feature” is potentially a disaster waiting to happen. With my memory-management layer I should be able to conclusively decide if any future OOME was because of my manual BB’s or the implicit BB’s (it catches OOME and dumps the full status of every known buffer so I can manually verify whether it really should have “run out”). If the latter, then java.nio.* is, in a word, buggered. I very much hope that whatever was causing the OOME turns out only to be some bug (JVM, libraries, or me abusing them by accident) that affects explicit BB’s.
I’ve patched the main server, and now I’m just waiting to see what happens.