Programming to the cache

DzzD · September 24, 2008, 12:16pm

[quote]I know I’m late coming to this thread, but: Wha…? Since when? Proof?
[/quote]
hum, why ??? this does not have to be proof, it as always been the case, it is true in C, C++ and certainly in all language…

on PC, shift by one is a single machine op, most of the time it is one cpu cycle (if register already loaded)
multiplication require operand 2 to be loaded and the machine op for multiplication ususally take more than one cpu cycle

on a pc shift will become something like that :
MOV EAX,value
SHL (one cpu cycle)

multiplication something like that
MOV EAX,value
MOV EBX,2
MUL EBX (2/3 cpu cycle)

erikd · September 24, 2008, 12:38pm

but:

[quote](as according to the JIT sourcecode) POT multiplications (of constants) are indeed replaced by bit-shifting
[/quote]
Also, from what I understand SHL doesn’t execute in 1 cycle but rather in 6, which is due to the removal of the barrel shifter Cas talked about (it used to execute in 1 cycle on 486).

Anyway, in asm the probably fastest way to do multiplication by 2 is using a simple addition (well, it used to be the fastest way in the old days but it has been a long time since I last did asm).

Markus_Persson · September 24, 2008, 1:26pm

AMD seems to still have them. Or at least the one I have at home.

DzzD · September 24, 2008, 3:10pm

[quote]Anyway, in asm the probably fastest way to do multiplication by 2 is using a simple addition (well, it used to be the fastest way in the old days but it has been a long time since I last did asm).
[/quote]
so, three methods, anybody motivated to benchmark that :

n<<=1
n*=2
n+=n

Riven · September 24, 2008, 3:31pm

microbenchmarks are nearly impossible in java
in reallife cases, performance is already impossible to predict
further, it’ll depend on CPU, OS, and JVM.

Lets not waste time on it

erikd · September 25, 2008, 9:45am

Well, “<<1 is faster than *2” was your idea to begin with
You’re right of course that it’s almost impossible to measure this accurately and that the results might very well be totally different in real case scenarios, but you’re forgetting that benchmarks are fun

On my machine at work (Dual core P4 @3.2GHz, 1GB, Java 1.6.0_03-b05):
-client
2515ms : val *= 2;
2515ms : val <<= 1;
3765ms : val += val;

-server
1969ms : val *= 2;
1968ms : val <<= 1;
1891ms : val += val;

For what it’s worth (if anything):


public class Test {

    public static void main(String[] args) {
        int vResult = 0;
        long ms1 = Long.MAX_VALUE;
        long ms2 = Long.MAX_VALUE;
        long ms3 = Long.MAX_VALUE;

        for (int i = 0; i < 10; i++) {
            long start = System.currentTimeMillis();
            vResult += test1();
            long ms = System.currentTimeMillis() - start;
            if (ms < ms1)
                ms1 = ms;
            System.out.println(ms + "ms : val *= 2;");

            start = System.currentTimeMillis();
            vResult += test2();
            ms = System.currentTimeMillis() - start;
            if (ms < ms2)
                ms2 = ms;
            System.out.println(ms + "ms : val <<= 1;");

            start = System.currentTimeMillis();
            vResult += test3();
            ms = System.currentTimeMillis() - start;
            if (ms < ms3)
                ms3 = ms;
            System.out.println(ms + "ms : val += val;");
            System.out.println(vResult);
        }
        System.out.println("--------------------------------------------------------");
        System.out.println(ms1 + "ms : val *= 2;");
        System.out.println(ms2 + "ms : val <<= 1;");
        System.out.println(ms3 + "ms : val += val;");

    }

    /**
     * @return
     */
    private static int test1() {
        int val = 1;
        for (int ii = 0; ii < 2000000000; ii++) {
            val *= 2;
            val |= (ii & 1);
        }
        return val;
    }

    /**
     * @return
     */
    private static int test2() {
        int val = 1;
        for (int ii = 0; ii < 2000000000; ii++) {
            val <<= 1;
            val |= (ii & 1);
        }
        return val;
    }

    /**
     * @return
     */
    private static int test3() {
        int val = 1;
        for (int ii = 0; ii < 2000000000; ii++) {
            val += val;
            val |= (ii & 1);
        }
        return val;
    }

}

The last figures are the fastest times that were measured.
I did “val |= (ii & 1);” to prevent val to be shifted into zero, and thus to prevent the (server) JVM to do any clever optimization with that.

DzzD · September 25, 2008, 3:52pm

[quote]Well, “<<1 is faster than *2” was your idea to begin with
You’re right of course that it’s almost impossible to measure this accurately and that the results might very well be totally different in real case scenarios, but you’re forgetting that benchmarks are fun
[/quote]
cool, yup dont know really why but I agree with that “bench are fun”, thanks for the bench code, I’ll post mine results.

EDIT:

[quote]2515ms : val *= 2;
2515ms : val <<= 1;
[/quote]
and

[quote]1969ms : val *= 2;
1968ms : val <<= 1;
[/quote]
very curious results, maybe they are both translated into the same op code

ewjordan · September 25, 2008, 5:24pm

Earlier in this thread I posted a reference to the function in Hotspot that does this replacement; the question, then, is less whether it can happen than “In what situations does the Jit skip optimizations?”. That’s something I still don’t have much idea about, other than that for very long methods it will stop optimizing at some point. Does anyone know the cutoff, though? I couldn’t find the right debug VM flag to see for myself…

Riven · September 25, 2008, 7:26pm

It’s more that methods are inlined, and inlined, and inlined and inlined, and eventually something triggers the end of inlining.

That trigger is way more important than the cut-off in a single method, because nobody in their right mind write such long methods anyway.

erikd · September 25, 2008, 9:50pm

On my 1.8GHz Athlon at home:

-server
3515ms : val *= 2;
3531ms : val <<= 1;
3281ms : val += val;

-client
4578ms : val *= 2;
4593ms : val <<= 1;
5547ms : val += val;

Kind of interesting to see the difference in behaviour of val+=val between -client and -server, both on Intel and AMD… Especially on the client, this isn’t quite ‘noise level’…

I could imagine that the amount of inlining might depend on the CPU/instruction cache size you’re running on?

Riven · September 25, 2008, 10:17pm

A better indicator of * vs << would be *8 and << 3 (or so) because of what Cas said.

If you want to investigate (which is ‘fun’ right!) just monitor the very un-informative output of -XprintCompilation (or something similar, I bet you guys know). After a while you see the patterns, and how whole methods are removed / optimized away all together… now when you think of it, it’s a miracle we still have those perfect stacktraces in Java, as the CPU certainly doesn’t have those methods in its stack!

About my previous comment on ‘please let’s not waste time’: I’m just getting tired of analyzing JIT performance. It’s not only very dependant on the slightest changes in your sourcecode, it’s also simply non-deterministic!

You can run some benchmark like 8 times (8 times launching the JVM => 8 processes), and get 2 totally different groups of results, like 8s, 8s, 13s, 8s, 13s, 13s, 8s, 8s. So… I’m certainly not writing anymore microbenchmarks for a while, only when I can get like more than factor 2 performance gain (in the general case), it’s worth messing with your precious time.

Measuring performance differences between float[] and FloatBuffer is even worse! Both have very different cache characteristics! running the benchmarks on float[432], float[433], float[434] can have VERY different results. Which are in turn not comparable to directs FloatBuffers with capacities: 423, 433, 434.
In general the calculations/sec are very much like a max(maxPerformance, 1/dataSize + sqrt(dataSize)), with MAJOR spikes down (up to factor 3 slower) at different intervals of dataSize for float[] and FloatBuffer, where in average float[] is 10-20% faster, depending on dataSize.

So whenever you think your FloatBuffer performance is bigger than your float[] performance (by a large margin), you probably just hit a down-spike on your float[]. Try increasing the float[] length by 1 (!), or on another CPU, or other memory, or wait for full moon and light some candles, it helps! See? Way too much variables to get any meaningful results in 1 lifetime.

erikd · September 25, 2008, 11:40pm

Yes, in fact I already tried that (<<10 versus *1024 to test zammbi’s case). Exactly the same result…
Actually I couldn’t find anything about that more shifting would make the shift slower.

Hmmm, are you sure it’s not just the PC you’re testing on that has some funky background tasks going on or something? Performance test results are usually quite consistent IME, even though I don’t always understand the results…
I do notice some strange hiccups sometimes, like running a performance test the 2nd time giving different results, but I’m not seeing that very often and I usually just blame my PC and try again… I also found that killing eclipse can sometimes make benchmark test results more consistent if things behave erratically.

DzzD · September 26, 2008, 1:14am

[quote]Measuring performance differences between float[] and FloatBuffer is even worse! Both have very different cache characteristics! running the benchmarks on float[432], float[433], float[434] can have VERY different results. Which are in turn not comparable to directs FloatBuffers with capacities: 423, 433, 434.
In general the calculations/sec are very much like a max(maxPerformance, 1/dataSize + sqrt(dataSize)), with MAJOR spikes down (up to factor 3 slower) at different intervals of dataSize for float[] and FloatBuffer, where in average float[] is 10-20% faster, depending on dataSize.
[/quote]
I really dont understand how a FloatBuffer could be faster than a simple float ?? I mean you access a float directly (except bounds checks…, it is only a memory read, FloatBuffer requiere a method call that make things slower , no ?

about bounds checks and their slowness (and a bottleneck when making computation on a big array), I was thinking that a good addition to the java language could be something like the foreach statement that way array acces will be cntrolled by the JVM and bounds check will no have to be performed, that would be cool…


int b[]=new int[1024*1024];
foreach(int c in b)
{
 c++;
}

:

<<1
maybe <<1 will stay as fast or faster than *2 as it is a very simple bit op and very simple bit op are “hard-wired ?”, maybe in some case and when JIT have enought time it replace *2 by <<1 ? ==> same results ?

zammbi · September 26, 2008, 2:49am

Well one other thing is to test if the results are the same as a AMD. Since princec said that the P4 hasn’t got bit shifting and your using a P4.
But really I don’t think its something to worry about…

ewjordan · September 26, 2008, 6:36am

I’d tend to agree, anything less than a factor of 2 is not worth worrying greatly about or arguing over, especially in Java. Which brings us back to the original topic of this thread, if you’ll pardon my re-hijacking of the discussion: cache misses, which if minimized can result in greater speed improvements than a mere factor of 2. The question was essentially, do people notice the same significant speedups in Java as in C++ due to programming cache friendly algorithms? This is pretty significant, because if not, then it would actually be better to stick with naive random access algorithms and let the JVM do what it does rather than re-casting problems and potentially throwing more ops at the processor for negligible (or even negative) speed gain.

The facts as I see them at the moment are:

Nobody here seems to have happened across a situation in the field where a predictable (i.e. caused by your choice of algorithm, not some JVM quirk) cache effect showed a major speedup in a program, otherwise they would have said something. Right?
The simple interleaved vs. separate array speed tests I ran, flawed as they might have been in Java, did not exhibit anywhere near the 5x speed discrepancy that I saw in C++, and furthermore, all the Java versions beat out the cache-unfriendly C++ version by a significant factor.

So it’s at least plausible (though by no means certain or proven) that in Java, cache optimizations are already largely carried out by the JVM, and this is a less fruitful avenue for optimization than it is in languages that give you direct memory control.

My original purpose was to predict whether it would actually help or not to restructure my physics engine to make contact solving cache friendly - the bottleneck in the engine is almost always the contact solver, which (at least in C++) tends to be dominated by cache misses because it solves islands of bodies based on contact graph connectivity instead of array index, and thus jumps all over the body array, looping several times doing random accesses all over the place. It’s possible to copy away all the relevant variables to an interleaved array at the beginning of the solve and put them in a “nice” order, then iterate, and then copy them back. This is more work for the processor, but usually in C++ it actually goes a lot faster because almost every access hits the cache whereas naively you miss very often.

I originally though it would be a lot of work to set up the engine this way, but now it actually looks like it will be fairly simple. So I’ll just try the damn thing and see if it helps or not, that should give a much clearer real life answer than any micro-benchmark I can come up with, so we’ll see…

Riven · September 26, 2008, 7:36am

No, the JVM does some very smart optimisations on natively ordered direct buffers.

DzzD:

about bounds checks and their slowness (and a bottleneck when making computation on a big array), I was thinking that a good addition to the java language could be something like the foreach statement that way array acces will be cntrolled by the JVM and bounds check will no have to be performed, that would be cool…
int b[]=new int[1024*1024];
foreach(int c in b)
{
 c++;
}
:

Simple loops and slightly more complex loops have not suffered from bound-checks for ages now. Make a benchmark in C, and you’ll get exactly the same performance as in Java, as long as SIMD is not enabled.

jezek2 · September 26, 2008, 8:00am

ewjordan:

My original purpose was to predict whether it would actually help or not to restructure my physics engine to make contact solving cache friendly - the bottleneck in the engine is almost always the contact solver, which (at least in C++) tends to be dominated by cache misses because it solves islands of bodies based on contact graph connectivity instead of array index, and thus jumps all over the body array, looping several times doing random accesses all over the place. It’s possible to copy away all the relevant variables to an interleaved array at the beginning of the solve and put them in a “nice” order, then iterate, and then copy them back. This is more work for the processor, but usually in C++ it actually goes a lot faster because almost every access hits the cache whereas naively you miss very often.

I’ve just tried it with JBullet. It supports both cache friendly and unfriendly accesses, here are results:

client VM: cache friendly: ~30 fps, unfriendly: ~22 fps
server VM: cache friendly: ~50 fps, unfriendly: ~42 fps

I’ve used BasicDemo which has stack of 125 boxes. To test yourself, add this code after creating DiscreteDynamicsWorld:


dynamicsWorld.getSolverInfo().solverMode &= ~SolverMode.SOLVER_CACHE_FRIENDLY;

Also remember to press ‘d’ button to disable deactivation when testing.

xinaesthetic · September 26, 2008, 8:03am

DzzD:

about bounds checks and their slowness (and a bottleneck when making computation on a big array), I was thinking that a good addition to the java language could be something like the foreach statement that way array acces will be cntrolled by the JVM and bounds check will no have to be performed, that would be cool…
int b[]=new int[1024*1024];
foreach(int c in b)
{
 c++;
}

There is a for each statement, the syntax looks like


int b[]=new int[1024*1024];
for(int c : b)
{
 c++;
}

but it’s no use for altering primitive types as in the example because of the lack of pointers to primitives. I’d have assumed everyone here knew about this, but I’m slightly surprised you wrote what you did in that case.

Riven · September 26, 2008, 8:31am

Remember DzzD writes for Java 1.1 IIRC.

Everybody else knew

DzzD · September 26, 2008, 9:46am

hehe, true, I am not really new-features&shortcut friend… basically I will never learn the full Java API spec as well as in all other languages, I am not a Java coder, just a coder!

[quote]…but it’s no use for altering primitive types as in the example because of the lack of pointers to primitives…
[/quote]
I am afraid that new bachelor will knew all high levels features without never learn about base. people say it is good for conception etc… but I think that the main reason that new software requiere so much memory and cpu even if they could have been made faster in “smaller configuration” computer. not sure but the sentence above make me thing about that a little… can you explain what you mean exactly, how to you think Java modify an int, what difference do you see between c [ x ] and *c= ?

[quote]Simple loops and slightly more complex loops have not suffered from bound-checks for ages now. Make a benchmark in C, and you’ll get exactly the same performance as in Java, as long as SIMD is not enabled.
[/quote]
nop, Java bounds checks are bottleneck, perform any smart benchmark and the result will show that random Java array acces is at least two time (up to 5 times for random acces) slower than a direct memory acces in another language as C or even in Basic… I will try to take some times and make benchmark between ASM/C/C++/Java on PC and I can already give you results :
1 - ASM
2 - C/C++
3 - Java