Hi Mikael,
Thanks for trying this out. I agree that it needs much better documentation - in our rush to get it out, Dmitri only had time to create a quick README on running it. I’ll work on a README for adding and modifying the tests.
To get you started, here is a quick rundown of the design:
- Everything is a “Node”
- There are Group nodes that let us group the tests. These also affect the GUI and the hierarchical naming of the nodes in the options and results files
- Option objects are nodes that let you specify some sort of test option - there are many types such as Enable, Toggle, Integer, Lists of objects or int values, etc. Most option nodes are visited during the run of tests so that they can set a value or modify the environment that the test is run under.
- Tests are nodes that perform tests, they are also a subclass of the Enable type of option so they can be turned on or off
- Options files are hierarchical dumps of the tree where the Option nodes write out their (fully qualified hierarchical) names and values
- Results files are HTML files that contain a dump of the Java properties, and for each test a dump of the options in effect for that test and a list of the raw timings (algorithm for averaging is chosen by the program that dumps the results)
- I’d have to go back and do some investigation to document the way that the tree is traversed when writing out options or running the tests.
Would we accept a major rewrite of the GUI? Absolutely, it is in sore need of a much better GUI. Rewriting the GUI shouldn’t have to invalidate any of the existing results files in any way so this should be a non-contentious change to make and accept back.
Would we accept changes in the “look” of the tests? It’s not clear why that is necessary - the “look” is mainly there so that we can be sure that the tests are doing the work that we prescribed for them - “Yep, the colors are different every call”, “Yep, they are honoring the clip”, etc. Any change to the “look” of the tests might jeopardize the tight loops that we run them in and water down the results - note that when I wrote the tests I was seeing that even a boolean test in the inner loop could affect the results by 5-10% which makes the benchmark less “pure”. The benchmark it replaced tried to do random and the result was that it got roughly half of the work done as this benchmark (all of that was overhead of the benchmark itself).
Using statistical analysis to determine when the testing is “done”? The current design is very simplistic in that respect and it would be nice to have something like this. I’d defer to someone with more experience in judging statistical variance for benchmark testing on this any day.
How disruptive would it be if some changes affected the results? Fairly disruptive, but not out of the question if there is good reason (the tests looking better is not a very good reason). We’d simply have to rerun the major baselines, though some intermediate baselines done on “weekly builds” may have to be tossed out - we are more concerned with the results from the major releases, all of which we have archived and can rerun. Determining what would be “good reason” would involve more discussion of specific cases.
Adding meta-tests? That would be nice. One of the things that really killed performance in 1.2 and 1.3 was changing the attributes on the fly. The “same color/different color” setting is enough to show that particular problem, but other attribute changing interactions may exist and we currently have little support to benchmark those.
Adding new tests? Great!
Some random responses to your random thoughts - hope they help!
…jim