Java2D Benchmark: J2DBench

As Chris mentioned in his blog:
http://weblogs.java.net/blog/campbell/archive/2005/03/strcrazy_improv_1.html

And Jim in this javalobby post:
http://www.javalobby.org/java/forums/m91828040

you now have access to our internal Java2D benchmark,
J2DBench, through the mustang.dev.java.net project.
(See Jim’s post for more details on how to get the benchmark.)

Please try it, and let us know your comments, results, etc.

Hopefully we’ll be able to create a separate project for J2DBench
in the future, so people can contrfibute new tests, or improve
the UI, or whatever. Right now releasing it as part of mustang
source drop was the best we could do.

At this point, contact Chris or me if you want to get something
fixed.

One of the things we wanted to do, but had no time to: when
creating an options file using the UI, show the user an approximate
time the benchmark would take to complete. Otherwise it’s
very easy to create an options file which will take days to execute.

Thanks,
Dmitri

Hi Dimitri,

Thanks for this. I have downloaded and used it, a bit.

It is a quite hard to get a grasp of how things are architectured since there are no comments. I think just an hour’s work with a write up on the general architecture would help a lot.

I’m busy at the moment (messing around with JavaBeans and custom PropertyEditors), but I shall make me some time to check it out more thouroughly later.

Q: If someone were to rewrite much of it, especually the GUI and general run-order, but keep the actual types of tests. Would that ever have any chance to make it back into the main branch? I understand if you have too much historical data that would be useless after such a rewrite, but sometime a line has to be drawn?!

Have you concidered variance based benchmarks? I mean benchmark with a result quality threshhold instead of running it a certain number of times. For instance quit a sub test when the results are within a 1% variance?

That would make the benches MUCH faster to run. Especially when there are many iterations and quite stable times, such as with a fillRect(Color).

Also, to make it more used, and contributed to, I think the visuals of the actual benchmarks must be spiced up a bit. Some random?

I would also suggest to add a outer loop, or something, where different batch lengths are iterated. For instance:

  1. Totally random operations (contexts)
  2. 10 exact operations before swithing context
  3. 100 exact operations before swithing context
    and so on. That way the new batching setups, especially for OpenGL i guess, can be better tested and compared to non-batched.

Just some random thoughts that I needed out of my head… :wink:
There are more though, but now it’s dinner time…

Cheers,
Mikael Grev

Hi Mikael,

Thanks for trying this out. I agree that it needs much better documentation - in our rush to get it out, Dmitri only had time to create a quick README on running it. I’ll work on a README for adding and modifying the tests.

To get you started, here is a quick rundown of the design:

  • Everything is a “Node”
  • There are Group nodes that let us group the tests. These also affect the GUI and the hierarchical naming of the nodes in the options and results files
  • Option objects are nodes that let you specify some sort of test option - there are many types such as Enable, Toggle, Integer, Lists of objects or int values, etc. Most option nodes are visited during the run of tests so that they can set a value or modify the environment that the test is run under.
  • Tests are nodes that perform tests, they are also a subclass of the Enable type of option so they can be turned on or off
  • Options files are hierarchical dumps of the tree where the Option nodes write out their (fully qualified hierarchical) names and values
  • Results files are HTML files that contain a dump of the Java properties, and for each test a dump of the options in effect for that test and a list of the raw timings (algorithm for averaging is chosen by the program that dumps the results)
  • I’d have to go back and do some investigation to document the way that the tree is traversed when writing out options or running the tests.

Would we accept a major rewrite of the GUI? Absolutely, it is in sore need of a much better GUI. Rewriting the GUI shouldn’t have to invalidate any of the existing results files in any way so this should be a non-contentious change to make and accept back.

Would we accept changes in the “look” of the tests? It’s not clear why that is necessary - the “look” is mainly there so that we can be sure that the tests are doing the work that we prescribed for them - “Yep, the colors are different every call”, “Yep, they are honoring the clip”, etc. Any change to the “look” of the tests might jeopardize the tight loops that we run them in and water down the results - note that when I wrote the tests I was seeing that even a boolean test in the inner loop could affect the results by 5-10% which makes the benchmark less “pure”. The benchmark it replaced tried to do random and the result was that it got roughly half of the work done as this benchmark (all of that was overhead of the benchmark itself).

Using statistical analysis to determine when the testing is “done”? The current design is very simplistic in that respect and it would be nice to have something like this. I’d defer to someone with more experience in judging statistical variance for benchmark testing on this any day.

How disruptive would it be if some changes affected the results? Fairly disruptive, but not out of the question if there is good reason (the tests looking better is not a very good reason). We’d simply have to rerun the major baselines, though some intermediate baselines done on “weekly builds” may have to be tossed out - we are more concerned with the results from the major releases, all of which we have archived and can rerun. Determining what would be “good reason” would involve more discussion of specific cases.

Adding meta-tests? That would be nice. One of the things that really killed performance in 1.2 and 1.3 was changing the attributes on the fly. The “same color/different color” setting is enough to show that particular problem, but other attribute changing interactions may exist and we currently have little support to benchmark those.

Adding new tests? Great!

Some random responses to your random thoughts - hope they help!

…jim