clamp() within POT

Riven · October 10, 2009, 11:01pm

I was trying to speedup some boundschecks on data that had a POT dimension.

So I needed to clamp integers between 0…63, 0…127, 0…255, 0…511, etc.

clamp(val, 0, 511) clampPOT(val, 9)


   public static final int clampPOT(int val, int bits)
   {
      // every negative value will become 0
      val = ((val >> 31) | (val - 1)) + 1;

      // every value >= (POT-1) will become (POT-1)
      return (((0 - (val & (-1 << bits))) >> 31) | val) & (~(-1 << bits));
   }


   public static final int clamp(int val, int min, int max)
   {
      return val < min ? min : (val > max ? max : val);
   }

I had this assumption that branching was kinda slow, and fiddling with bits was ultra super fast. I was going to proudly present my bits fiddler on here… Turns out the non-branching code is slower on my Intel Code2Quad.

Benchmark:
clamp(val, 0, 511) ==> 19ms clampPOT(val, 9) ==> 28ms

It’s not a benchmark-flaw, as my code gets slower in the real algorithm too.

Let this be a lesson for… all of us :

pjt33 · October 12, 2009, 11:47am

I discovered years ago that this is true for integer abs implementations too: (x < 0 ? -x : x) was faster than (x ^ (x >> 31)) + (x >>> 31). I even tried using Jasmin to make various versions of the non-branching version and they were all slower than the branching one, even though I’m pretty sure my test set was about 50% negative, so it’s worst-case for branch prediction.

Riven · October 12, 2009, 12:11pm

Makes you wonder why loop-unrolling yields so much performance gain…

Either that, or the JIT is screwing up. I should bench this code in C.

princec · October 12, 2009, 12:31pm

Branches aren’t quite so expensive as they were once now that all these modernfangled CPUs have got branch prediction and parallel paths where they actually start executing the code on both sides of a conditional branch before the condition is evaluated.

Cas

Roquen · October 12, 2009, 12:36pm

You could reduce the number of bit operations (still slower through). This is probably not the shortest, but as an example:


public static int clampBitHacko(int x, int max)
{
  int t0 = (max-x) >> 31;   // t0 = (x > max) ?  -1 : 0
  int t1 = (    x) >> 31;   // t1 = (x <   0) ?  -1 : 0
  int t3 = t0 & max;        // t3 = (x > max) ? max : 0
  int t4 = ~(t0 | t1);      // t4 = -1 if in range, otherwise 0
  
  return (x & t4) + t3;
}

Probably slower: trade an ‘if’ for a shift for the in-range values.


public static int clampPOTx(int val, int bits)
{
  int t = val >> bits;
  
  if (t == 0)
    return val;
  
  return (t < 0) ? 0 : 1<<bits;
}

WRT: Abs, your code is using two instead of one shift. This is faster than compares on my boxes:


    int t = x >> 31;
    x ^= t;
    x -= t;
    return x;

The costs of branching keeps jumping around with different cores. Also on small code in each block, the compiler may be able to convert into conditional moves and/or masking.

Markus_Persson · October 13, 2009, 8:43am

What CPU is this on?

Roquen · October 13, 2009, 9:20am

If you’re asking me about abs, then the last P4 stepping and a Core-2 Quad. Where the difference is greater on the Core-2.

Markus_Persson · October 13, 2009, 10:04am

I have some vague memory of integer barrel shifting not being available on some modern cpu (I think it was intel), that’s why I ask.

So a i<<15 would be 15 times slower than a i<<1

Abuse · October 13, 2009, 10:53am

While i’m not entirely sure how branch prediction works in modern cpus, presumably there is a finite number of branches that can be predicted simultaneously?

So, while a micro-benchmark comparing a branching abs() against a non-branching abs() might indicate the branching version is faster; if you introduce it into a more complex app. where a high density of branching code already exists, your results may vary.

princec · October 13, 2009, 11:08am

Barrel shifting was removed on P4 and beyond.

Cas

Roquen · October 13, 2009, 12:21pm

@Markus_Persson: fixed and variable length shifts have the same cost, regardless of number of positions shifted on intel.

@Abuse: I’m saying that the non-branching version of abs is faster on both NetBurst and Core architectures. Didn’t you mean the non-branching clamp? Another big issue for clamp is the increased number of operations which are basically serial thus not allowing multiple issues to occur…so your point is valid: a non-branching version might be superior integrated in another piece of code.

As for branch predicitors, this stuff varies in a given family. I think Core has 16 and they are only ‘consumed’ on a taken branch, but this is from memory and most likely wrong.

princec · October 13, 2009, 1:19pm

Ain’t this exactly the sort of optimisation that JVMs are supposed to make based on architecture, rather than us plebian programmers?

Cas

pjt33 · October 13, 2009, 1:50pm

I think replacing branches with bit hackery or vice versa is probably a bit much to expect from any optimiser.

Roquen · October 13, 2009, 2:07pm

Simple branches to bit-hacks isn’t too bad. It’s a simple pattern match like for converting constant muls to addition-chains (and other tech), reducing bit-ops chains, etc. The real trick is scheduling with a constantly moving target (a la arch changes). I’m hoping that eventually one of the LLVM based projects will make enough progress to be usable.

Demonpants · October 13, 2009, 2:32pm

Couldn’t you use synchronized? ??? ??? ???

Anyway, I actually haven’t bothered to try to write a very fast clamp method, I typically just do this:


myValue = whatever;
myValue = Math.min(myValue, max);
myValue = Math.max(myValue, min);

No doubt that’s by far the slowest way you can do it. 3 assignments, two function calls, and who knows what’s going on within Math. Oh well, it’s never been my bottleneck so I haven’t bothered.

DzzD · October 17, 2009, 12:51pm

and IMO, you just prove it again :

2 different cases for clamp() :
v<vmin => 1 conditional
v>min => 2 conditional

so it performs 1 to 2 ops (conditional branches)

only 1 cases for clampPLOT and it performs :

2 times
<< 2 times

4 times
& 2 times

1 times
= 1 times
| 2 times
~ 1 times

finally 15 ops for any input value for clampPOT() against 1 or 2, lets say 1.5 ops for clamp() ( also substraction and addition are usully a little slower then other ops )

this give a ratio of : 15/1.5 => according to your sample (28 ms / 19 ms ) and “grosso modo” a conditional branches is about 5 times slower than other ops and thats why unrolling loop always give huge improvment

about zero clamp, why not :
val *=1^(val >> 31);

Riven · October 17, 2009, 1:27pm

You’re right, but I expected the difference to be much bigger, and on older CPUs, it was.

Roquen · October 17, 2009, 4:11pm

There a typo or a C-ism:

val *=1^(val >>> 31);

or trade an ‘and’ for the ‘mul’:

val &= ~(val >> 31);

DzzD · October 17, 2009, 4:51pm

oups… this is a C-ism…

[quote]val &= ~(val >> 31);
[/quote]
nice one, much faster than a mul