Programming to the cache

Riven · June 24, 2008, 5:51am

[quote]Quote

[quote]also x*2 is much slower than x<<1
[/quote]
Again, not on today’s VMs, especially if the 2 is written as a compile time constant. This is one of the easiest optimizations for a compiler to make, so I would be shocked if the JVM wasn’t doing it. If I’m correct it’s even done in most C/C++ compilers these days, though I haven’t checked that.
[/quote]
No no no no! (no. :)) Indeed C compilers have done this forever, but the JVM never has.
If you don’t see a difference, that means your bottleneck (like memory I/O) is somewhere else.

You might have read my thread about MappedObjects I created yesterday. I added code to handle the case where the stride is a power-of-two and replace a single *16 by a <<4 (or *32 by <<5 so that matter), and saw the performance double. That was because the bottleneck was actually there. As you saw, in another situation, I could squeeze in a ‘massive’ swap64 method, without a speed-penalty. So if you’re not seeing your optimisation paying off, that doesn’t mean your code has become faster, just that the CPU is waiting. Once you remove the other bottleneck, you’ll notice that your earlier optimisation did speedup your code.

The VM is really not that smart, although it’s trivial to replace a constant multiplkication by a bitshift.

I’ll do some actual tests with your benchmark (yesterday I didn’t have the time). There are so much more tricks to make that code faster (that’s why I said it was rather poor). Your main slowdown is in the loop-overhead. AFAIK the JVM does some loop unrolling, but doing it yourself (copy the body 2-4 times inside the loop) is much faster.

Again, it’s hard to do micro-benchmarking, not only because the JVM rewrites most of your code, but also because your might think the bottleneck is somewhere else, or are even unaware of certain types of bottlenecks.

Last, your have that giant method that runs all the benchmarks. The JIT stops moptimising methods that are longer than a certain amount of instructions. That ‘magic amount’ is rather low, and your method will certainly not be turned into the most efficient native code. You can take a look at the output of the Debug VMs (release candidates with debug commandline options, IIRC) from Sun, which will tell you when the JIT gives up.

I’ll posts some tests later today.

DzzD · June 24, 2008, 8:53am

arf, juste notice that most of your code is unoptimized, just an exemple below:

private void loopOptimized() {
		int limit = source.length >> 1;
		for (int i=0; i<limit; ++i ) {
			dest[(i<<1)] = someFunctionOptimized(source[(i<<1)+1]);
			dest[(i<<1) + 1] = someFunctionOptimized(source[(i<<1)]);			
		}	
	}

for exemple can be replaced by

private void loopOptimized() {
		int limit = source.length >> 1;
		for (int i=0; i<limit;i+=2 ) //I eard that i++ is not optimized (not translated in ASM inc)
                               {
                                                int i1=i+1;
			dest[i] = someFunctionOptimized(source[i1]);
			dest[i1] = someFunctionOptimized(source[i]);			
		}	
	}

it is always good to factorize repetitive pattern.

or

for (int i=0; i<particles; ++i) 
{
 float x = i*i*.01f;
 float vx = i*.1f;
 floats[2*i+1] = vx;
 floats[2*i] = x;
 floatsx[i] = x;
 floatsvx[i] = vx;
 if (firstTime) particleList[i] = new Particle(); 
 particleList[i].setX(x);
 particleList[i].setVx(vx);
}

replaced by:


Particle p;
for (int i=0; i<particles; i++) 
 particleList[i]=new Particle(); //Use pool object...

for (int i=0; i<particles; i++) 
{
 int i2=i<<1;
 float x = i*i*0.01f;
 float vx = i*0.1f;
 floats[i2+1] = vx;
 floats[i2] = x;
 floatsx[i] = x;
 floatsvx[i] = vx;
 p=particleList[i];
 p.setX(x);
 p.setVx(vx);
}

ewjordan · June 24, 2008, 10:17am

Riven - Yes yes yes yes yes. (No, actually, I’m just kidding, I know enough to know that I’m probably wrong here. )

That all makes sense (er, well, let me rephrase: let’s just say that your explanation makes sense - the behavior itself is %^$&@ maddening, that simple local optimizations like that are not somewhere in the compilation process when they would seem so straightforward to implement at literally any point along the path from .java text to native code, and I’m further surprised to see firsthand that manual unrolling of trivial loops actually helps things in any significant manner, but you’re right, it absolutely does! End rant.). I suppose I’m still too often confusing the “should be” with the “is” when it comes to the JVM (a habit I must re-disabuse myself of), and unfortunately, as you mentioned, it’s tricky to accurately measure these things, so a lot of times there’s confirmation bias at work.

I wrote a longer response before with some more questions, but just had a computer crash so I lost it, and I’m about to go to sleep, so I’ll hit the main point. Please, don’t waste too much effort optimizing that code I posted before, it was just a quick stab, not anything I actually need, especially since I pretty much already have the conclusion I wanted from it (major speed boosts based on simple cache considerations when picking data representation are not quite as automatic in Java as in C++, so more detailed examination is necessary).

To me it would actually be most useful to see a simple example exhibiting, for instance, a well enough optimized loop that I could see for myself an effect like the speed difference between a multiply and a shift - while I definitely trust you on that, I tried briefly and wasn’t able to make the difference visible, despite a very no nonsense loop unrolled a bit with no other obvious bottlenecks. But it’s becoming clear to me that a lot of my conceptions about what is causing stalls have to shift considerably, especially in Java, and a lot of things that I would have thought would hurt performance may actually help it, so it would be great if you could point me to a pared down example that I could take as a baseline and work from.

BTW, do you know of any good references about this stuff that are somewhat recent? Googling seems to pull up mostly either ancient optimization guides from the dot com era or feel good assurances from “experts” that insist that we don’t need to optimize Java code because the next version of the VM will do everything including our taxes for us anyways…

@DzzD: yeah, a lot of that could be played with quite a bit, however what I was intending to test there was whether the sole difference between a multiply and a shift was visible with all else remaining the same, so that was mostly just contrived to use a bunch of shifts for no purpose but to use them; turns out that I sort of misinterpreted what Riven was saying anyways, which renders the example moot.

I always feel vaguely bad every time I pull out subexpressions like that in Java, though, as if it’s a) wasted effort, and b) perhaps doing more damage than good - am I again just feeling unreasonable pangs of guilt based on what Some Guy At Sun once said about how I should let the JVM do that type of work for me, or do those types of optimizations actually happen inside HotSpot? I mean, this is compilers 101 stuff, but I suppose without verification I should not assume much.

i++ not optimized? Hm, can anyone confirm that? ++i and i++ should definitely be translated differently from each other, but it would surprise me to learn that they are not each being converted to their best representations in bytecode.

(also, careful - the new code doesn’t do the same thing anymore, you’d need to pull the shift from the limit declaration)

DzzD · June 24, 2008, 11:06am

[quote]@DzzD: yeah, a lot of that could be played with quite a bit, however what I was intending to test there was whether the sole difference between a multiply and a shift was visible with all else remaining the same, so that was mostly just contrived to use a bunch of shifts for no purpose but to use them; turns out that I sort of misinterpreted what Riven was saying anyways, which renders the example moot.
[/quote]
ok, so the problem in your bench is that *2 or <<1 spent so low cpu than the final bench result wont be affected as most of the time will be spent in array(i) read/write ops.

<<1 is faster than *2 this is a fact in any language, this is related to ASM/Machine op code. ex: *2 requiere loading two register, perform mul and sometime store result back.

but let say <<1 use about 2 cpu cycle and *2 maybe use about 5 cpu cycle and all array(i) + for + other stuff… maybe 200 cycles.

In one case your full loop will spent 202 and in another 205 so you wont see any differents in your results.

[quote] I always feel vaguely bad every time I pull out subexpressions like that in Java, though, as if it’s a) wasted effort, and b) perhaps doing more damage than good - am I again just feeling unreasonable pangs of guilt based on what Some Guy At Sun once said about how I should let the JVM do that type of work for me, or do those types of optimizations actually happen inside HotSpot? I mean, this is compilers 101 stuff, but I suppose without verification I should not assume much.
[/quote]
:o hum… factorizing make both code optimised and clean (more readable), this is a good practice!! just imagine that for any reason you have to change i+1 by i+2 …

[quote]Some Guy At Sun once said about how I should let the JVM do that type of work
[/quote]
dont think they was thinking of things like the following wich is absolutly awfull code :
y=x2;
z=x2;
w=x2;
t=t+x2;

DzzD · June 24, 2008, 6:57pm

[quote]- final keyword makes no difference at all to performance anymore, anywhere, at any time. Only use it when it “should” be used, not because you think it will make things faster.
[/quote]
When a small method is set final I guess the JIT can inline it, setting two methods final in one of my demo make FPS grow from 27 to 31.

ewjordan · June 25, 2008, 12:41am

(We’re now thoroughly off topic, which is fine by me, this is all interesting stuff - just wanted to put a disclaimer that nothing in this post relates to cache friendly programming anymore, this is 100% in the nitpicky micro-optimization category!)

[quote]<<1 is faster than 2 this is a fact in any language, this is related to ASM/Machine op code. ex: 2 requiere loading two register, perform mul and sometime store result back.
[/quote]
Right, and I would never make the claim that the actual multiplication operation was as fast as a shift, I’m well aware of the differences in the operations once we’re down to bare metal. But in most programming languages, there actually is no difference in the assembly generated when you type in n2 as opposed to (n<<1), they are both turned into left shifts; if that’s happening, then it’s generally better practice to just write what you mean (esp. given how easy it is to screw up the precedence rules with shifts, for instance by accidentally writing (x<<1 + 1) to replace (2x + 1)). I was simply surprised by the claim that the JVM would let this type of thing slide by, as it’s a textbook example of an easy to implement peephole optimization.

After another failed stab at making a performance difference visible, I took a dig through the HotSpot 1.6.3 source to try and see exactly what’s going on and why this might have been left out (it’s all quite interesting, I’m absolutely going to dig through more of it to see what’s happening under the hood), and I see that it definitely does have code that can recognize an integer multiply by a constant power of two and transform it to a left shift (see share->vm->opto->multnode.cpp at line 145 in MulINode::Ideal(PhaseGVN *phase, bool can_reshape)). But I don’t know enough about the internals to immediately see whether it’s live or not, or whether it’s perhaps a low priority optimization that is usually never reached, and don’t have the time right now to get this beast compiling and throw in some traces to see whether that code is getting hit. Maybe if I have some extra time this week.

Riven, is it possible that the performance improvements you’re seeing as a result of those replacements are happening in functions that for some reason or another are getting skipped over during optimization, perhaps for method length reasons like you mentioned before? I totally expect you to prove me wrong on this again, by the way, but I can’t help but ask now that I’m actually looking at the source that’s supposed to do this transform!

Yes, that code is bad bad bad from a maintenance and readibility perspective. But I’m not sure off the top of my head whether it will be any slower, ultimately, than if you eliminated the extra multiplies, as some compilers will recognize that they can simplify and generate some very optimal assembly from that code, especially in such a simple situation. I don’t know if HotSpot would do very well at it, though I’d be curious to know, because most of the time when I omit common subexpression eliminations it’s because the algorithm is such that the change would actually reduce the readability or logic of the code, so there is a bit of a real tradeoff. This happens a lot with math expressions, where you are carrying out a tricky geometric procedure or something like that and for clarity it’s best to preserve the geometrical objects locality in code rather than slicing and dicing to reduce the operation count.

DzzD · June 25, 2008, 12:13pm

[quote]there actually is no difference in the assembly generated when you type in n*2 as opposed to (n<<1), they are both turned into left shifts;
[/quote]
arf, as it is not 100% true that’s better to get the habit to use <<1 or <<2 or etc… you will finally use it automatically and after severals years thinking in binary you wont see any more difference between *2 and <<1 when reading a code.

ewjordan · June 25, 2008, 10:28pm

@DzzD: I think we’re fundamentally in agreement here about the speed issue, we just differ a little bit as far as style goes.

Oh, I was curious, so I looked into the increment issue you mentioned before, and I don’t think you need to worry about it, if I understand your concern correctly - both ++i and i += 1 are translated to the same bytecode (IINC 1 1, ILOAD 1), and i++ turns into (ILOAD 1, IINC 1 1), as expected. Though while we’re on the topic, it’s worth mentioning that 32767 is the cutoff for the IINC operator: i += 32768 translates to (ILOAD 1, LDC 32768, IADD, ISTORE 1, ILOAD 1), which could potentially be slower. Not that anyone’s usually “incrementing” by numbers that big, but…

erikd · June 26, 2008, 7:44am

This is quite interesting.
It makes me wonder though why java’s bounds checks seem as expensive as they are. Especially with lots of array access, it should basically be free then? It isn’t though.

DzzD · June 26, 2008, 7:48am

[quote]@DzzD: I think we’re fundamentally in agreement here about the speed issue, we just differ a little bit as far as style goes.
[/quote]
oki

[quote]This is quite interesting.
It makes me wonder though why java’s bounds checks seem as expensive as they are. Especially with lots of array access, it should basically be free then? It isn’t though.
[/quote]
this slow array access is for me the biggest performance issue in java

erikd · June 26, 2008, 7:54am

absolutely!

ewjordan · June 26, 2008, 9:44am

I would agree with you, the loop counter and array range are usually both going to be in registers, which should allow them execute while waiting for the array element to come through from the cache or memory; however, the range check always happens first, so even if the access is slower, it might not be initiated until the range check is already finished, which would be a slight waste of time. I’m not too familiar with how this out of order stuff happens in hardware, so I could be wrong. With decent jump prediction, I’d at least think the next bounds check should be executed while waiting for the previous access to finish…

Are we that the slowness is actually due to the bounds checks, though? See http://weblogs.java.net/blog/kohsuke/archive/2008/03/deep_dive_into.html - the bounds checks are actually being correctly eliminated in at least some cases for the “fast” part of the loop (the JVM splits most loops of reasonable and known length into a short pre-phase, a longer fast phase, and a short post-phase so that it can safely unroll - 8x in that example - and align things nicely, as well as trim out the bounds check for the fast phase), though the example there is a bit trivial for comfort…but with constant bounds, which the JVM is checking for, it shouldn’t be difficult to eliminate most bounds checks in an identical manner. In that case only 16 checks are performed out of 256 possible, which should add a negligible amount of time to the loop, and the fast loop doesn’t look much different upon a quick inspection from what you’d expect from a C compiler. The JVM could certainly do better, as in that example there doesn’t need to be a single bounds check, but 16 out of 256 isn’t that bad.

Anyone up to a compare and contrast between the assembly generated from a noticeably slow Java array access loop and a fast C one so we can see exactly what the difference is?

It might also be worth pinning down what kind of code will cause the bounds check to happen every time through the loop, because that’s something you’d definitely want to avoid.

erikd · June 26, 2008, 10:14am

[quote]however, the range check always happens first, so even if the access is slower, it might not be initiated until the range check is already finished, which would be a slight waste of time.
[/quote]
True, however in tight inner loops that’s only really true at the first iteration.

I know that the JVM is capable of eliminating bounds checks in some trivial cases; I’m mainly referencing random array access where bounds checks always happen.
I haven’t directly compared against C, but what I’ve seen is that java’s speed can be comparable to C until you do lots of random array access. What I did try is disabling bounds checks in GCJ in a random array access heavy program (an emulator), which made the whole thing speed up by a really large amount (more than 2x iirc).

Abuse · June 26, 2008, 4:46pm

I would definitely agree with ewjordan on this one - your source-code should be as verbose as possible.
If you are performing a bit-wise operation, use the shift operators - if you are doing a division or multiplication use the intended math operators.

If you want the solution to be performant, and you don’t trust the VM runtime to make the obvious optimisation - then you should pass your classes through an appropriate bytecode optimiser before deployment.

I’ve more than once had to fix bugs caused by code authors failing to realize the relationship “n/2 == n>>1” is only valid for natural numbers.(non-negative integers)

DzzD · June 26, 2008, 5:27pm

[quote]I’ve more than once had to fix bugs caused by code authors failing to realize the relationship “n/2 == n>>1” is only valid for natural numbers.(non-negative integers)
[/quote]
I did this error too…

of course readability is good, just said that >>1 does not seem less readable than /2.

EDIT: but it seems that java use the right decal op (not >>>) to make it work even with negative , do I am wrong ?

Abuse · June 26, 2008, 6:32pm

I’m referring to the difference in rounding behaviour.

right shift rounds toward negative infinity.
division rounds toward 0.

e.g.

-3/2 = -1
-3>>1 = -2

ewjordan · June 26, 2008, 6:44pm

Ok, that’s definitely one way to prove that the bounds checks are actually having an effect! The irony here is that random access is the one area where Java should be able to really shine over non-GCed languages because it could be arranging things well in memory; if it’s just throwing all that away due to bounds checks, that’s really a shame.

One thing that I’ve noticed while browsing the source - HotSpot is able to prove certain facts about some of your variables to itself, for instance:


int[] myArray = new int[200];

if (x > 5) {
  if (x < 100) {
    //Hotspot knows that x is between 5 and 100 here, so in theory could skip the bounds check (I haven't verified that it does)
    myArray[x] += 1;
  }
}

The problem is, in a lot of random access cases the proof that you’re not going outside the range is a lot more complicated than some nested ifs, and I’m not sure how deeply things get analyzed and how far variable ranges propagate - not very far, I’m sure. Also, from what I can tell most of the JVM’s optimization is purely per-function (please correct me if I’m wrong on this!), so any range restrictions that might be present globally will not be factored in. To some extent this is unavoidable, since knowing whether we’re going outside the bounds of an array in general is equivalent to the halting problem, but I’m sure a lot more could be done here…

The more I think about this, the more I’m wondering whether what we really should be pushing for at this point is a more powerful optimization mode than -server. To me it seems that while in theory the idea of having a quick and dirty compile phase is great and all, the balance (even with -server) is quite a bit too far in favor of reducing pauses due to compilation at the expense of the ultimate speed of the program; furthermore, nobody really uses Java for any of the light, quick starting things that would have their overall performance impacted at all by a few stutters at first. Let’s be honest, nobody was ever happy with the client JVM, despite its supposed benefits over the server one. Is anyone actually irritated by (or does anyone even notice, for that matter) the time taken to compile? I know when I turn on the compilation print statements, I’m almost always annoyed at how quickly things stop happening, especially when I know there’s so much more that could be done…

jezek2 · June 26, 2008, 7:22pm

I’m ‘happy’ with the client JVM on my machine (single core P4 northwood 2.4ghz), because the server VM has irregular pauses due to compiles over a long time span. I think this is due to reaching compilation treshold slowly on not so often called methods.

I’m really looking forward to the tiered approach, hoping it will be without these pauses (that are not much acceptable for GUI/game apps).

DzzD · June 26, 2008, 7:26pm

[quote]-3/2 = -1
-3>>1 = -2
[/quote]
ok

[quote]Ok, that’s definitely one way to prove that the bounds checks are actually having an effect!
[/quote]
yup that’s really annoying, enable manual disabling on that would be great, but not really java like

arm · June 27, 2008, 7:01am

There are already available some improvements to hotspot (6up6) that work on 64bits OS.

http://java.sun.com/javase/technologies/performance.jsp]]http://java.sun.com/javase/technologies/performance.jsp

According to Sun site:

“These performance enhancements will be included in an upcoming standard release of the Java SE platform.”

These improvements will probably include (http://www.ssw.uni-linz.ac.at/Research/Papers/Ko08/):

Scalar replacement: If an object does not escape the creating method, its
fields can be replaced by scalar variables.

Stack allocation: An object that is accessed only by one method and its callees
can be allocated on the stack instead of the heap.

Automatic Object Colocation and Inlining: Two objects, where one containing (by reference)
the other, will be stored in memory as one object.
It reduces the access costs of fields and arrays by improving
the cache behavior and by eliminating unnecessary field loads.

The same applies to objects containing an array of size known at creation time (String object included)
http://www.ssw.uni-linz.ac.at/Research/Papers/Wimmer08/
http://www.ssw.uni-linz.ac.at/Research/Papers/Haeubl08Master/

Regards

Angelo