Programming to the cache

erikd · September 24, 2008, 9:27am

To be honest, I wouldn’t be surprised if the decrease of CPU usage you were seeing had a different cause.

On a phone or something I could believe it could make a tiny difference, but on a modern PC with a decent JRE… :-
In any case, in a microbenchmark both ways are exactly as fast.

I do write things like <<1 sometimes because it looks faster, so it kind of hints the reader that this piece of code is very performance critical.
Otherwise, I’ve never seen an actual difference on modern PCs.

Riven · September 24, 2008, 9:46am

Well, I think (as according to the JIT sourcecode) POT multiplications (of constants) are indeed replaced by bit-shifting, but there is this other problem: the ‘sweepspot’ of the JIT that one has to touch / potentially touches / probably misses.

I’;ve seen cases like this:

Code Listing A


arr[p] = a[p] * b[p] + c[p];
arr[p+1] = a[p+1] * b[p+1] + c[p+1];
arr[p+2] = a[p+2] * b[p+2] + c[p+2];
arr[p+3] = a[p+3] * b[p+3] + c[p+3];

Code Listing B


arr[p+0] = a[p+0] * b[p+0] + c[p+0];
arr[p+1] = a[p+1] * b[p+1] + c[p+1];
arr[p+2] = a[p+2] * b[p+2] + c[p+2];
arr[p+3] = a[p+3] * b[p+3] + c[p+3];

As you can see, B should be slightly slower, or equally fast, if the JIT removes the redundant addition by zero.

Yet in…
_that_very_specific_case_with_those_very_specific_other_in_a_certain_order_compiled_classes_
the addition of +0 in the sourcecode DOUBLED performance - in a real app, not a silly microbenchmark.

The JIT just took a different optimalisation path…!

And please don’t try to say I’m wrong, because in your benchmark the change has no impact. I bet it has no impact, simply because it doesn’t have the same condition as:
_that_very_specific_case_with_those_very_specific_other_in_a_certain_order_compiled_classes_

erikd · September 24, 2008, 10:29am

You’ve given this example before as proof of your ‘bump the JVM in a random way and strange performance consequences may happen’ theory. The thing is, if changes like that can change performance that much, it might as well have been the other way around, right?

When you gave this example earlier, I’ve tested this in JEmu2 (where there are a lot of things like that in unrolled loops and such) but I guess I never hit that very specific case… Believe me, in JEmu2 i’ve tried MANY strange things to improve performance (JEmu2 is very sensitive to micro optimizations) and rarely did I encounter such things and usually the optimizations that did make a difference made logical sense.

I’m not saying you’re wrong because I have seen strange performance jumps/drops, but in my experience they don’t happen often and you can never optimize in anticipation of that. In the end it’s still a matter of profiling -> optimizing -> profiling etc.
But, you’re right if you say that it might be good to keep these things in the back of your head when you’re optimizing your code. It’s also good to keep in mind that you might be optimizing for a specific JVM on a specific platform when things like ‘+0’ makes such a difference.

If you take this back to the *2 versus <<1 thing, if what zammbi was seeing was caused by hitting some strange JIT sweetspot, then my point remains that it was not necessarily caused by the fact that <<1 is faster than *1, and this sweetspot might as well have been hit by writing *2 instead of <<1.

zammbi · September 24, 2008, 10:32am

Oh it was <<10, my mistake. It was also a method that was called a lot per second, haven’t checked if this is still the case though.

princec · September 24, 2008, 11:03am

Here’s a tidbit… the P4 no longer has a barrel shifter so anything more than << 1 takes more cycles these days, as I understand it.

Cas

DzzD · September 24, 2008, 12:16pm

[quote]I know I’m late coming to this thread, but: Wha…? Since when? Proof?
[/quote]
hum, why ??? this does not have to be proof, it as always been the case, it is true in C, C++ and certainly in all language…

on PC, shift by one is a single machine op, most of the time it is one cpu cycle (if register already loaded)
multiplication require operand 2 to be loaded and the machine op for multiplication ususally take more than one cpu cycle

on a pc shift will become something like that :
MOV EAX,value
SHL (one cpu cycle)

multiplication something like that
MOV EAX,value
MOV EBX,2
MUL EBX (2/3 cpu cycle)

erikd · September 24, 2008, 12:38pm

but:

[quote](as according to the JIT sourcecode) POT multiplications (of constants) are indeed replaced by bit-shifting
[/quote]
Also, from what I understand SHL doesn’t execute in 1 cycle but rather in 6, which is due to the removal of the barrel shifter Cas talked about (it used to execute in 1 cycle on 486).

Anyway, in asm the probably fastest way to do multiplication by 2 is using a simple addition (well, it used to be the fastest way in the old days but it has been a long time since I last did asm).

Markus_Persson · September 24, 2008, 1:26pm

AMD seems to still have them. Or at least the one I have at home.

DzzD · September 24, 2008, 3:10pm

[quote]Anyway, in asm the probably fastest way to do multiplication by 2 is using a simple addition (well, it used to be the fastest way in the old days but it has been a long time since I last did asm).
[/quote]
so, three methods, anybody motivated to benchmark that :

n<<=1
n*=2
n+=n

Riven · September 24, 2008, 3:31pm

microbenchmarks are nearly impossible in java
in reallife cases, performance is already impossible to predict
further, it’ll depend on CPU, OS, and JVM.

Lets not waste time on it

erikd · September 25, 2008, 9:45am

Well, “<<1 is faster than *2” was your idea to begin with
You’re right of course that it’s almost impossible to measure this accurately and that the results might very well be totally different in real case scenarios, but you’re forgetting that benchmarks are fun

On my machine at work (Dual core P4 @3.2GHz, 1GB, Java 1.6.0_03-b05):
-client
2515ms : val *= 2;
2515ms : val <<= 1;
3765ms : val += val;

-server
1969ms : val *= 2;
1968ms : val <<= 1;
1891ms : val += val;

For what it’s worth (if anything):


public class Test {

    public static void main(String[] args) {
        int vResult = 0;
        long ms1 = Long.MAX_VALUE;
        long ms2 = Long.MAX_VALUE;
        long ms3 = Long.MAX_VALUE;

        for (int i = 0; i < 10; i++) {
            long start = System.currentTimeMillis();
            vResult += test1();
            long ms = System.currentTimeMillis() - start;
            if (ms < ms1)
                ms1 = ms;
            System.out.println(ms + "ms : val *= 2;");

            start = System.currentTimeMillis();
            vResult += test2();
            ms = System.currentTimeMillis() - start;
            if (ms < ms2)
                ms2 = ms;
            System.out.println(ms + "ms : val <<= 1;");

            start = System.currentTimeMillis();
            vResult += test3();
            ms = System.currentTimeMillis() - start;
            if (ms < ms3)
                ms3 = ms;
            System.out.println(ms + "ms : val += val;");
            System.out.println(vResult);
        }
        System.out.println("--------------------------------------------------------");
        System.out.println(ms1 + "ms : val *= 2;");
        System.out.println(ms2 + "ms : val <<= 1;");
        System.out.println(ms3 + "ms : val += val;");

    }

    /**
     * @return
     */
    private static int test1() {
        int val = 1;
        for (int ii = 0; ii < 2000000000; ii++) {
            val *= 2;
            val |= (ii & 1);
        }
        return val;
    }

    /**
     * @return
     */
    private static int test2() {
        int val = 1;
        for (int ii = 0; ii < 2000000000; ii++) {
            val <<= 1;
            val |= (ii & 1);
        }
        return val;
    }

    /**
     * @return
     */
    private static int test3() {
        int val = 1;
        for (int ii = 0; ii < 2000000000; ii++) {
            val += val;
            val |= (ii & 1);
        }
        return val;
    }

}

The last figures are the fastest times that were measured.
I did “val |= (ii & 1);” to prevent val to be shifted into zero, and thus to prevent the (server) JVM to do any clever optimization with that.

DzzD · September 25, 2008, 3:52pm

[quote]Well, “<<1 is faster than *2” was your idea to begin with
You’re right of course that it’s almost impossible to measure this accurately and that the results might very well be totally different in real case scenarios, but you’re forgetting that benchmarks are fun
[/quote]
cool, yup dont know really why but I agree with that “bench are fun”, thanks for the bench code, I’ll post mine results.

EDIT:

[quote]2515ms : val *= 2;
2515ms : val <<= 1;
[/quote]
and

[quote]1969ms : val *= 2;
1968ms : val <<= 1;
[/quote]
very curious results, maybe they are both translated into the same op code

ewjordan · September 25, 2008, 5:24pm

Earlier in this thread I posted a reference to the function in Hotspot that does this replacement; the question, then, is less whether it can happen than “In what situations does the Jit skip optimizations?”. That’s something I still don’t have much idea about, other than that for very long methods it will stop optimizing at some point. Does anyone know the cutoff, though? I couldn’t find the right debug VM flag to see for myself…

Riven · September 25, 2008, 7:26pm

It’s more that methods are inlined, and inlined, and inlined and inlined, and eventually something triggers the end of inlining.

That trigger is way more important than the cut-off in a single method, because nobody in their right mind write such long methods anyway.

erikd · September 25, 2008, 9:50pm

On my 1.8GHz Athlon at home:

-server
3515ms : val *= 2;
3531ms : val <<= 1;
3281ms : val += val;

-client
4578ms : val *= 2;
4593ms : val <<= 1;
5547ms : val += val;

Kind of interesting to see the difference in behaviour of val+=val between -client and -server, both on Intel and AMD… Especially on the client, this isn’t quite ‘noise level’…

I could imagine that the amount of inlining might depend on the CPU/instruction cache size you’re running on?

Riven · September 25, 2008, 10:17pm

A better indicator of * vs << would be *8 and << 3 (or so) because of what Cas said.

If you want to investigate (which is ‘fun’ right!) just monitor the very un-informative output of -XprintCompilation (or something similar, I bet you guys know). After a while you see the patterns, and how whole methods are removed / optimized away all together… now when you think of it, it’s a miracle we still have those perfect stacktraces in Java, as the CPU certainly doesn’t have those methods in its stack!

About my previous comment on ‘please let’s not waste time’: I’m just getting tired of analyzing JIT performance. It’s not only very dependant on the slightest changes in your sourcecode, it’s also simply non-deterministic!

You can run some benchmark like 8 times (8 times launching the JVM => 8 processes), and get 2 totally different groups of results, like 8s, 8s, 13s, 8s, 13s, 13s, 8s, 8s. So… I’m certainly not writing anymore microbenchmarks for a while, only when I can get like more than factor 2 performance gain (in the general case), it’s worth messing with your precious time.

Measuring performance differences between float[] and FloatBuffer is even worse! Both have very different cache characteristics! running the benchmarks on float[432], float[433], float[434] can have VERY different results. Which are in turn not comparable to directs FloatBuffers with capacities: 423, 433, 434.
In general the calculations/sec are very much like a max(maxPerformance, 1/dataSize + sqrt(dataSize)), with MAJOR spikes down (up to factor 3 slower) at different intervals of dataSize for float[] and FloatBuffer, where in average float[] is 10-20% faster, depending on dataSize.

So whenever you think your FloatBuffer performance is bigger than your float[] performance (by a large margin), you probably just hit a down-spike on your float[]. Try increasing the float[] length by 1 (!), or on another CPU, or other memory, or wait for full moon and light some candles, it helps! See? Way too much variables to get any meaningful results in 1 lifetime.

erikd · September 25, 2008, 11:40pm

Yes, in fact I already tried that (<<10 versus *1024 to test zammbi’s case). Exactly the same result…
Actually I couldn’t find anything about that more shifting would make the shift slower.

Hmmm, are you sure it’s not just the PC you’re testing on that has some funky background tasks going on or something? Performance test results are usually quite consistent IME, even though I don’t always understand the results…
I do notice some strange hiccups sometimes, like running a performance test the 2nd time giving different results, but I’m not seeing that very often and I usually just blame my PC and try again… I also found that killing eclipse can sometimes make benchmark test results more consistent if things behave erratically.

DzzD · September 26, 2008, 1:14am

[quote]Measuring performance differences between float[] and FloatBuffer is even worse! Both have very different cache characteristics! running the benchmarks on float[432], float[433], float[434] can have VERY different results. Which are in turn not comparable to directs FloatBuffers with capacities: 423, 433, 434.
In general the calculations/sec are very much like a max(maxPerformance, 1/dataSize + sqrt(dataSize)), with MAJOR spikes down (up to factor 3 slower) at different intervals of dataSize for float[] and FloatBuffer, where in average float[] is 10-20% faster, depending on dataSize.
[/quote]
I really dont understand how a FloatBuffer could be faster than a simple float ?? I mean you access a float directly (except bounds checks…, it is only a memory read, FloatBuffer requiere a method call that make things slower , no ?

about bounds checks and their slowness (and a bottleneck when making computation on a big array), I was thinking that a good addition to the java language could be something like the foreach statement that way array acces will be cntrolled by the JVM and bounds check will no have to be performed, that would be cool…


int b[]=new int[1024*1024];
foreach(int c in b)
{
 c++;
}

:

<<1
maybe <<1 will stay as fast or faster than *2 as it is a very simple bit op and very simple bit op are “hard-wired ?”, maybe in some case and when JIT have enought time it replace *2 by <<1 ? ==> same results ?

zammbi · September 26, 2008, 2:49am

Well one other thing is to test if the results are the same as a AMD. Since princec said that the P4 hasn’t got bit shifting and your using a P4.
But really I don’t think its something to worry about…

ewjordan · September 26, 2008, 6:36am

I’d tend to agree, anything less than a factor of 2 is not worth worrying greatly about or arguing over, especially in Java. Which brings us back to the original topic of this thread, if you’ll pardon my re-hijacking of the discussion: cache misses, which if minimized can result in greater speed improvements than a mere factor of 2. The question was essentially, do people notice the same significant speedups in Java as in C++ due to programming cache friendly algorithms? This is pretty significant, because if not, then it would actually be better to stick with naive random access algorithms and let the JVM do what it does rather than re-casting problems and potentially throwing more ops at the processor for negligible (or even negative) speed gain.

The facts as I see them at the moment are:

Nobody here seems to have happened across a situation in the field where a predictable (i.e. caused by your choice of algorithm, not some JVM quirk) cache effect showed a major speedup in a program, otherwise they would have said something. Right?
The simple interleaved vs. separate array speed tests I ran, flawed as they might have been in Java, did not exhibit anywhere near the 5x speed discrepancy that I saw in C++, and furthermore, all the Java versions beat out the cache-unfriendly C++ version by a significant factor.

So it’s at least plausible (though by no means certain or proven) that in Java, cache optimizations are already largely carried out by the JVM, and this is a less fruitful avenue for optimization than it is in languages that give you direct memory control.

My original purpose was to predict whether it would actually help or not to restructure my physics engine to make contact solving cache friendly - the bottleneck in the engine is almost always the contact solver, which (at least in C++) tends to be dominated by cache misses because it solves islands of bodies based on contact graph connectivity instead of array index, and thus jumps all over the body array, looping several times doing random accesses all over the place. It’s possible to copy away all the relevant variables to an interleaved array at the beginning of the solve and put them in a “nice” order, then iterate, and then copy them back. This is more work for the processor, but usually in C++ it actually goes a lot faster because almost every access hits the cache whereas naively you miss very often.

I originally though it would be a lot of work to set up the engine this way, but now it actually looks like it will be fairly simple. So I’ll just try the damn thing and see if it helps or not, that should give a much clearer real life answer than any micro-benchmark I can come up with, so we’ll see…