JGF v3 - status

blahblahblahh · May 12, 2005, 9:12pm

Most of the webstart problems are fixed by trying “outside” JGF for reasons less obvious than you think. For instance, if you load a JNLP from your hard disk you completely bypass MSIE. For instance, if you load a JNLP that is 100% broken BUT it is working just enough for webstart to pick the HREF out of the jnlp tag then it will IGNORE your jnlp and use the one it sees online, etc etc. There are many many issues that make it very hard to test and debug webstart issues at the webstart client, including of course all the problems to do with the fact that the webstart client is running your app in a rather funky and bizarre manner.

Anyway, tonight I ran 3 hours of stress tests over my LAN against the server, and found some interesting issues.

every page that uses templates is 4 times slower than pages that do not, even though the latter use 4 or 5 NONCACHED MySQL queries each (read: makes them very slow). Velocity is terrifyingly slow as a caching template system, and we may have to dump it (it’s also full of horrendous bugs, as almost everyone has seen by now)
there is what appears to be a race condition somewhere in the system. It triggers entirely randomly, anywhere from 161 requests into the run up to 15,000 requests into the run. It causes the JVM to hang completely with no exceptions, no errors, and no CPU usage. Hence I suspect it’s a deadlock, although I have no idea whether it’s my code, grexengine code, or the base libraries I’m using. This could be very painful indeed to track down.
I believe I’ve worked around the main cause of the frequently recurring outofmemory error by implementing my own memory manager to manage bytebuffers. I am unhappy that Sun didn’t provide any memory management support for NIO at all and that I’m back to digging out old C textbooks to refresh how to do memory management effectively.

However, I noticed in passing that there are many places in user code that are creating implicit bytebuffers. Because Sun have no support for mem management and appear to have flaws if not outright bugs in their garbage-collection of these things this “feature” is potentially a disaster waiting to happen. With my memory-management layer I should be able to conclusively decide if any future OOME was because of my manual BB’s or the implicit BB’s (it catches OOME and dumps the full status of every known buffer so I can manually verify whether it really should have “run out”). If the latter, then java.nio.* is, in a word, buggered. I very much hope that whatever was causing the OOME turns out only to be some bug (JVM, libraries, or me abusing them by accident) that affects explicit BB’s.

I’ve patched the main server, and now I’m just waiting to see what happens.

blahblahblahh · May 12, 2005, 9:15pm

I’ve also made the JNLP creation code a lot more fascist: it will go through all the fields handed to it and strip HTML. Where it finds P tags it will automatically pull out the contents of the first paragraph and only use them. This makes the “nanoblaster” game currently on JGF finally work (because it strips out the A HREF that was crashing Sun’s crummy parser) - but, sadly, that game’s author wrote it so that it crashes if it fails to initialise sound (*), and so it wouldn’t run on the soundless machine I tried it on :(. But at least it starts :).

(*) a very common problem that IMHO needs to be a FAQ because of the subtlety of it. Very annoying to write games and not realise what happens to machines with e.g. no sound card.

blahblahblahh · May 12, 2005, 9:19pm

Finally, in case it wasn’t sufficiently obvious, the image in my SIG here on JGO is dynamically created by the JGF server. If everyone’s browsers would render SVG images properly, I’d turn this on for all users, but at the moment so many browsers can’t do it that I need to run a post-processor that converts them to JPG’s. This massively slows down the server! If I have nothing better to do with my time at some point, I’ll work some magic with faked expires times and possibly a small in-memory LRU cache for the JPG’s and enable the feature for all users.

The idea is to create a SIG image automatically that keeps updating with things like “how popular my games are” (for authors) and “where I stand on the highscores tables on JGF games” (for games players)

swpalmer · May 12, 2005, 10:36pm

            <description kind="short" ><p>
For more details check out <a href="http://games.rastaduck.org/NanoBlaster/">the NanoBlaster home page</a>
</p></description>
The mind boggles what kind of parsing they’re doing; it’s only trivial XML (didn’t include any ampersand directives etc). I wonder how you could incorrectly parse that in such a way if you’re using any of the built-in XML parsers (DOM and SAX)?

The thing that comes to mind is that the description element you have there doesn’t have a single text node at all. It only has an HTML paragraph element. I don’t know if that is related to the problem, but as kevglass indicated, you can’t just go and put HTML wherever you want inside an XML document. Unless it said in the JNLP spec that HTML is allowed… (I didn’t check, but have assumed that it wasn’t allowed)

nonnus29 · May 12, 2005, 11:09pm

[quote] Finally, in case it wasn’t sufficiently obvious, the image in my SIG here on JGO is dynamically created by the JGF server.
[/quote]
That’s really cool, almost cool enough to motivate me to finish a game to get it in my jgf jpg… :o

blahblahblahh · May 13, 2005, 6:19am

One of the laziest ways to parse XML (if using SAX) would simply pick up all text in all subnodes as a by-product. Shrug. I’m not saying it should be done like that, but in many cases in people’s programs it does work because they decided to just ignore any tags inside tag X and just pick up all the text that was there. This used to be more common than it is now, possibly because more people use DOM ::).

Sigh. I knew from the start it was not meant to be HTML, but I also knew that XML parsers would cope with it without crashing so long as it was well-formed XML. AFAIAA (maybe I’m missing somethhing here?) if you use an XML parser properly it shouldn’t crash in this kind of scenario; unexpected tags within a tag you are reading will not affect you (your code simply won’t access them). It’s one of the advantages of XML: it’s robustness in the face of not-quite-right input.

The spec doesn’t say what will happen, only that it may not be HTML (IIRC the word is “may” and not “must”). So I tried it to see if it would pick up the text (it didn’t, fair enough - that’s a perfectly valid way to deal with it) and then it went onto my todo list to strip the HTML tags, somewhere around the middle. The fact that you’re only supposed to have one paragraph meant I would also need to put in logic to pick up just one paragraph.

In almost all cases, it’s no problem, it just means it has no description. But in this one case it kills the webstart client, and I’m still at a loss as to why/how. And, as we’ve seen, the error message is completely wrong (typical with Sun’s webstart client: IIRC I’ve not yet seen an error message that was 100% correct apart from the “the jars aren’t signed” one).

So…I’m not trying to defend the use of HTML in that tag, but I object to the awful error handling in the webstart client, and I am surprised that it caused any problem at all, let alone a “fatal” one :(.

blahblahblahh · May 13, 2005, 6:26am

;D everything in there is generated algorithmically, apart from the cog image itself. It was a bitch to get circular text, but I got it in the end ;), and even though it looks a bit “first steps with WordArt” it was just a proof of concept.

The big question now is: What should go in there? Anything is possible. Suggestions are welcome, and I’ll probably put a small form up in your profile page where you can pick e.g. 3 stats in particular to have in your image.

bleb · May 13, 2005, 10:01am

[quote]Hence I suspect it’s a deadlock, although I have no idea whether it’s my code, grexengine code, or the base libraries I’m using. This could be very painful indeed to track down.
[/quote]
At least on linux, hitting ctrl-\ prints a thread dump and identifies deadlocks

kappa · May 13, 2005, 10:05am

what ever u did to JGF for some reason its way cooler now!

blahblahblahh · May 16, 2005, 5:00am

Thanks for the tip ;). Sadly, since this is a server, the app runs as a background service (using the start/stop daemon) and I’ve no idea how you could send particular characters direct to the input stream. I also strongly suspect all output would disappear - the standard logging libraries only log log messages, not stdout and stderr.

(I’m sure if I spend a while digging in manpages I could find a way to re-route stuff to/from the daemon, but it’s just not trivial)

blahblahblahh · May 16, 2005, 5:11am

[quote]what ever u did to JGF for some reason its way cooler now!
[/quote]
Thanks. While Chris was slaving over the forums here, I was slaving over webstart-extensions, bugs in Sun’s webstart client, and cleaning-up the game templates.

You can see here: http://javagamesfactory.org/views/view-game?name=Survivor a taste of what the other games will be like, with their libraries shown at the bottom of the screen (note: this is all handled automatically, of course ;)).

You may also notice that there’s a spot for each author to include a small photo of themself … but it’ll be a while before I have time to add the DB table etc to store them.

Unfortunately, a super huge bug in webstart means that anyone who accessed JGF in the last few days has a high chance of having picked up JNLP files that crash the JGF server.

As far as I know there is nothing I can do about this: it’s a horrid, horrid bug in the webstart client AND because of the other bug in the client (the one where it caches broken JNLP’s indefinitely) it is not possible to get a new, working, JNLP to any of these people.

I don’t know; maybe I will just have to trace their IP addresses, and then post them, and ask anyone who recognires themself to delete their webstart cache. Until they do, they are submitting illegal HTTP requests to the server and causing a DoS attack >:(.

blahblahblahh · May 16, 2005, 5:21am

Thanks. While Chris was slaving over the forums here, I was slaving over webstart-extensions, bugs in Sun’s webstart client, and cleaning-up the game templates.

You can see here: http://javagamesfactory.org/views/view-game?name=Survivor a taste of what the other games will be like, with their libraries shown at the bottom of the screen (note: this is all handled automatically, of course ;)).

You may also notice that there’s a spot for each author to include a small photo of themself … but it’ll be a while before I have time to add the DB table etc to store them.

Unfortunately, a super huge bug in webstart means that anyone who accessed JGF in the last few days has a high chance of having picked up JNLP files that crash the JGF server.

As far as I know there is nothing I can do about this: it’s a horrid, horrid bug in the webstart client AND because of the other bug in the client (the one where it caches broken JNLP’s indefinitely) it is not possible to get a new, working, JNLP to any of these people.

I don’t know; maybe I will just have to trace their IP addresses, and then post them, and ask anyone who recognires themself to delete their webstart cache. Until they do, they are submitting illegal HTTP requests to the server and causing a DoS attack >:(.

EDIT: I remembered that other grexengine servers witnessed deliberate attacks like this a couple of years ago. There is something I can do about this; the base grexengine server is immune to these corrupt HTTP attacks, so it’s obviously something in my forked parser that’s broken. Once I’ve fixed my parser, the DoS should go away (for some reason, the broken requests are damaging any genuine requests that come in on different sockets at the same time; I have no idea how this is even possible, since there’s no shared data - it’s all done with ThreadLocal objects etc). But the people with broken copies of JGF games will find that they never work again, AFAIAA

maxcaleb · May 16, 2005, 11:42pm

Not sure how possible this is with your current setup, but JDK 1.5 includes some new profiling/analysis tools including jstack, which will perform the same function as
ctrl-, except you can attach it to an arbitrary pid.

blahblahblahh · July 31, 2005, 12:18am

Update…

All work on JGF halted whilst trying to prove + demonstrate to Sun the recently discovered probable NIO bug where it drops characters. I haven’t heard back from them in many weeks, and I don’t have time to wait for a response that may never come; I’ve now given up on running JGF over NIO.
Given that JGF runs on a web-application that runs on the GrexEngine, and the GrexEngine is a Service-Oriented Architecture / platform for hosting any kind of server, it’s not all that dissimilar from a servlet running on J2EE. Today I created a handful of GrexEngine wrapper classes that enable me to embed the entire app within a J2EE-compatible container, and have the J2EE container only handle the HTTP requests ***
As of this evening, the basic website is working (generating pages dynamically and outputing them to web browser), so things are going well. It’s only taken a couple of hours, and I hope to have the whole thing working again within a matter of days, depending on my free time.
Hopefully…with NIO out of the way, and if the bug really is from NIO, and not my code, JGF should be “fixed” very soon. I’m hoping it will also turn out that NIO or my HTTP parser was the cause of some of the stability problems, so that it will be more stable from now on. Performance will probably suffer a little, but the container is very fast and I doubt anyone will notice the difference!

*** - to anyone who’s interested, none of the crap of WAR’s etc is pulled in to contaminate things: the system is still managed by the GrexEngine, and its dynamic config and clustering system etc, it’s just that for HTTP (where NIO is apparently corrupting my channels) the J2EE container is an additional module that can generate and complete incoming requests and their responses.

princec · July 31, 2005, 1:06pm

It’s amazing how you can take 12 months to code a simple dynamic website in Java which would take Chaz about 2 weeks in PHP with Mysql eh?

Cas

blahblahblahh · July 31, 2005, 1:37pm

Thanks for the vote of confidence, you sarky git :(.

The vast majority of the time spent so far has been things that would have taken just as long if done in PHP - e,g,:

[]working through the intracacies of the webstart protocol
[]GUI design and improvement (making the admin controls easy to use just as much as the visual layout)
[]working through the intracacies of HTTP and CSS (in)compatibility
[]security controls, given how much of the site is admined on-line directly
[]modules for: library comparison and maintenance, article suggestion and editing, game-submission and editing, etc
[]…etc

…and if it’s so easy, why hasn’t everyone else done all that? How many JNLP-creators have YOU seen, for instance? I’ve seen a couple of partial ones, but none that did everything I wanted (…and, assuming the site is working again in a week or so, I can go back to finishing my final webstart creator that uses the version protocol)

In comparison a relatively small amount of time has gone on things like me breaking the HTTP parser, and making high-performance file upload and download that isn’t going to crash the server when a dozen or more people try to upload 20 Mb of JARs at the same time. When it got to the point where I had a bug in NIO I couldn’t seem to workaround, I took the cop-out approach and plugged-in Jetty (although I had to wait > 1 month to give the Sun engineer time to look at the server).

BTW: the problem with the trivial LAMP approach is that it completely breaks when the site becomes popular. You then have to spend an extra 6 months (or, look at gamedev.net - they spent upwards of 18 months with a broken site at this point. Or look at flipcode - 18+ months again to fix) to fix the thing, meanwhile your once-popular site is getting slagged off for getting worse and worse every day.

princec · July 31, 2005, 2:35pm

[quote]How many JNLP-creators have YOU seen, for instance?
[/quote]
LOL! One, as it happens. I hired a guy a few weeks ago and his first day on the job he whipped on up for our Webstart deployed app in PHP inside 30 minutes. (Each customer needs a custom jnlp file).

Cas

blahblahblahh · July 31, 2005, 2:47pm

Even someone who’s never written java code before could probably “whip up” a JNLP creator that made some trivial JNLP’s with a tiny amount of customization. That’s not the same as supporting the entire gamut of the JNLP spec, with arbitrary input data at every point - it takes around 30 minutes just to actually read the entire spec. Shrug.

Anyway, everything now appears to be working bar file-uploads (which is simply because I haven’t ported them to J2EE old-I/O yet).

Orangy_Tang · July 31, 2005, 3:12pm

[quote=“blahblahblahh,post:178,topic:20256”]
So any idea of an expected launch date? I might have to get off my arse and try and get this little shooty game up to a playable state for it…

woogley · July 31, 2005, 6:00pm

in about 3 months time I did Java Unlimited… though it doesnt have a JNLP generator (nor does it need one), but the full admin panel / game submitting / commenting etc is all there. I orignally was going to do it with Java but PHP was less bulky and easier to use (no driver loading in the actual code just to do a quick DB call, for example). Prior to Java Unlimited I didn’t know hardly any PHP, so it was also my way of learning it.

I’m not saying throw away all of your JGF code, but you might want to weigh your coding options a bit. So far I havent had any I/O problems (or any other unexplainable bugs) with PHP.

IMO, Java is better suited for programs… not webpages. But hey, just an opinion. :-X