Ok, verified the results again, so here’s the post I deleted:
Some initial experimentation shows some interesting things. I ran a simple 1D particle simulation (increment position based on velocity, apply a tiny bit of damping to velocity) in both Java and C++ comparing the behavior depending on whether I had two separate arrays for velocity and position or one interleaved array. I can post the code if anyone cares (and yup, I am very aware of the pitfalls when writing a Java microbenchmark!), but for now I won’t clutter things up. Here are the results, in FPS, higher is better:
(20000 particles over 10000 frames of simulation, 40 tests run and averaged)
C++ Results:
Normal floats: 5950.77
Interleaved floats: 30524.33
Java Results (Apple JVM 1.6, -server - I’ll check this on a Sun VM pretty soon, too):
Normal floats:22105.749484383392,
Interleaved floats: 22242.982339072023,
So by a MASSIVE amount, C++ with separate arrays performs the worst, and we see a 5:1 improvement there by going to interleaved arrays. But in Java, there’s essentially no difference, at least at this particle count - trust me, the sub percentage point difference you see there is essentially just noise.
The real surprising thing (to me, at least), is that at least Apple’s JVM actually does appear to be making good on the claim that it can achieve better cache locality than AOT compiled languages, as the non cache-friendly Java code performs four times as well as the corresponding C++.
I wondered whether the performance difference between Java and C++ could be accounted for by the bounds check, so I inserted a basic bounds check into the C++ (just breaks out if bounds check fails, which is probably not quite as complex or expensive as Java’s):
C++ (with bounds check):
Normal floats: 5956.94
Interleaved floats: 25678.14
No difference at all for the regular float version (two arrays), but brings the interleaved version within spitting distance of the Java one.
Somewhat relevant to Cas’ suggestion, I also tried using an array of Particle objects in Java (just holders for x and vx, in this case):
Java using Particle class: 17139.75
Using an array of objects is still better than the worst C++, but still a ways behind optimized Java; and we haven’t even touched the issue of creating objects on the fly or garbage collecting them, either, which are the real slowdowns that I tend to see in my code.
So there’s no real conclusion here, except that short of getting rid of those bounds checks, pure float arrays in Java are handled about as well as can be expected, and the JVM actually is doing quite a bit of work to keep the cache miss rate low.
This is now totally off topic, but I ran into one thing that I should warn you about - the initial version of the particle update code I wrote read something like this:
for (int i=0; i<particleList.length; ++i) {
particleList[i].x += .01f * particleList[i].vx;
particleList[i].vx *= 0.99f;
}
…and I was surprised to find that my timing information was heavily dependent upon the number of frames that I was simulating. But FPS should not depend on the number of frames you check, as long as it’s large, and I was surprised to be hitting a cutoff around 9000 where to push any higher meant an 80% drop in frame rate! (interestingly enough, the interleaved method weathered this drop far better than the others in Java)
The cause: it turns out that numerical underflow will freaking MURDER your program’s performance in either C++ or Java - that vx *= 0.99f line was bringing all the velocities down to the level of underflow after about 9000 frames, and from that point on it’s a major drag on performance. In this case I wasn’t interested in working around that issue, so I just changed the decay factor to 0.999f (putting off the inevitable a little further), but be aware of this in your own code, if you have a large number of variables that are used a lot and could exponentially decay to zero, it might be worth checking them against a threshold.
Things I’ve learned:
- The overall conclusion: the JVM does exactly what it promises to, at least with floating point arrays, optimizing them in almost every case to the point that they are similar in performance to well-used C++ arrays.
- I could not find a SINGLE way to affect the timing via a local (non-algorithmic) optimization - storing the loop limit away, changing to a while loop, replacing with a try/catch and a while(true), etc. - every one of these things performed EXACTLY the same
- final keyword makes no difference at all to performance anymore, anywhere, at any time. Only use it when it “should” be used, not because you think it will make things faster.
- Scope does not affect speed within a class (passing an array as a parameter vs. storing it as a member doesn’t make a difference)
- Visibility of array does not affect speed
- Using getters and setters to marshal access to private object members does not degrade performance
- for (Particle p:particles) loop performed exactly the same
- Any optimizations that are happening for fast array access do not seem to happen at garbage collection time - performance is identical before a single GC happens
- Memory layout management is probably NOT responsible for the fast access - disrupting the access patterns does not seem to make any difference, so the JVM must be doing some intelligent prefetching
@jezek2: on the Mac, the -server mode in 1.6 doesn’t do very much, I think it just changes the way the heap sizing operates, which is irrelevant for these tests. I tried in 1.5, though, and got the following:
Apple 1.5 without -server:
Normal FPS: 7530.917239361543
Interleaved FPS: 7536.807696165959
Object FPS: 6943.34652777474
Apple 1.5 with -server:
Normal FPS: 22029.56632155148
Interleaved FPS: 21603.01266973487
Object FPS: 18347.889937613505
Local FPS: 13744.841045723999
So at least on OS X, 1.5 should DEFINITELY be used with the -server flag. I included that “Local FPS” entry in the last one because that shows that at least Apple’s 1.5 -server has a speed problem passing arrays as local variables, I’m not sure why - I omitted that from all the other tests because it never differed.
I know that most people aren’t using Apple’s VM, so I’m going to boot into Windows right now and see how things work there…I’ll post some more results in half an hour or so.
And btw, I’ve never had problems with stuttering or anything like that due to the server flag, I tend to find that compilations pretty much cease after the first several seconds of gameplay, which you can keep an eye on with the -XX:-PrintCompilation flag - have you had trouble with this in the past?