Finding index of single 1 bit in a long

NativeOne · December 9, 2003, 2:14pm

Hi everybody. I’m working on a chess program which makes frequent calls to a method, that finds the position of the first (and only) set bit in a long value. So for 1->0, 2->1, 4->2, 8->3, 16->4, …

The straight-foward code is this:

public static int bitIndex1(long x) {
int r = 0;
while ((x & (1L << r)) == 0)
r++;
return r;
}

The enhanced version takes less time in the average case:

public static int bitIndex2(long x) {
int r = 0;
if ((x & 0xFFFFFFFF00000000L) != 0) r += 32;
if ((x & 0xFFFF0000FFFF0000L) != 0) r += 16;
if ((x & 0xFF00FF00FF00FF00L) != 0) r += 8;
if ((x & 0xF0F0F0F0F0F0F0F0L) != 0) r += 4;
if ((x & 0xCCCCCCCCCCCCCCCCL) != 0) r += 2;
if ((x & 0xAAAAAAAAAAAAAAAAL) != 0) r++;
return r;
}

Is it possible to do this even faster? On x86 there is a single assembly instruction which does this kind of stuff

Woz · December 9, 2003, 6:29pm

Arrh yes the good old BSF opcode.

What am I saying!?! It used to be a one of the slowest instructions on the 486 taking upto 43 clock cycles, people used to do everything to avoid using it including using the FPU stack!!

[cold shirver runs down spine…]

Luckily the 486 days are gone for good (unless you work for nasa).

Native One you seem to be using 64-bit Longs, it would be better to break it into 2 32-bit integers.

But do you need to be using longs in the first place?

The -server version may handle these longs better as it is SSE enabled, although I have no idea if it supports 64 bit longs (a job for MMX not SSE).

You also have do not have an immediate escape clause in that if x is zero you still run through all 6 of those 64-bit checks before returning zero.

On the server version you also have the ability to set the size of code you want to inline, this could seriously help you if bitIndex is being called reguarly.

Hope that helps a bit.

Woz.

Woz · December 9, 2003, 8:44pm

Native One,

After some testing I reckon this version is faster:


   public static int bitIndex4(long x)
   {
    final int a = (int)x;
    final int b = (int)(x >> 32);
    int r = 0;

    if( a == 0 )
    {
      if ((b & 0xFFFF0000) != 0) r += 16;
      if ((b & 0xFF00FF00) != 0) r += 8;
      if ((b & 0xF0F0F0F0) != 0) r += 4;
      if ((b & 0xCCCCCCCC) != 0) r += 2;
      if ((b & 0xAAAAAAAA) != 0) r++;
      r += 32;
    }
    else
    {
      if ((a & 0xFFFF0000) != 0) r += 16;
      if ((a & 0xFF00FF00) != 0) r += 8;
      if ((a & 0xF0F0F0F0) != 0) r += 4;
      if ((a & 0xCCCCCCCC) != 0) r += 2;
      if ((a & 0xAAAAAAAA) != 0) r++;
    }
   return r;
  }

Although I’m not happy about the casting at the top but I assume you have to use longs, also there is a bit of a hole in that your style of code allows a value of 0 to be returned as if a bit 0 has been set.

This has fallen through into the int version thus:
bitIndex2( 0L ) returns 0
bitIndex2( 1L ) returns 0
bitIndex4( 0L ) returns 32
bitIndex4( 1L ) returns 0

Which does not affect the way you want it to work but is something to at least be aware of.

Java 1.3.1
On a 1Ghz PIII coppermine with approx CPU Clock Speed = 997453224 Cycles per second.

I was getting a consistant average (on 10000 x 64 tests) of
1154 clock cycles per call on bitIndex2()
1050 clock cycles per call on bitIndex4()

These value includes the loop overhead per iteration but profiling down to this level if farely hard and should be considered a micro benchmark as on first run its possible to see how much cpu the hotspot takes to optimize the execution path when in bitIndex4() bit 32 is set causing to take the other branch.

Profiling the actual methods results in bitIndex4() being ~20cc faster than bitIndex2() when done in a smaller bits 1 to 64 set test. Which is a different value from the 10000 * 64 tests.

Microsoft JVM also mirrored this speed increase, although it is much faster, values of 515cc for bitIndex4() not unheard of (again jumps around a lot).

Although I’m not happy with the profiling environment as I just wacked the code into an existing applet, so there are plenty of other threads running.

So perhaps tomorrow I’ll try it in a clean testbed and see if the values are any more consistant.

Woz.

Woz · December 10, 2003, 5:58am

Guess who managed to post the wrong piece of code last night!
Ooops, sorry about that, this is the one you want:


  public static int bitIndex5(long x)
  {
    int r = 0;
    int a = (int)x;

    if( a == 0 )
    {
      a = (int)(x >> 32);
      r += 32;
    }

    if ((a & 0xFFFF0000) != 0) r += 16;
    if ((a & 0xFF00FF00) != 0) r += 8;
    if ((a & 0xF0F0F0F0) != 0) r += 4;
    if ((a & 0xCCCCCCCC) != 0) r += 2;
    if ((a & 0xAAAAAAAA) != 0) r++;

    return r;
  }

Caused by looking at the wrong spread sheet!! Anyway this one is faster and smaller than bitIndex4() as it avoids doing 2 casts in 50% of cases and the execution path is a little cleaner.

Sorry about that.

Woz.

erikd · December 10, 2003, 7:10am

I’m wondering how 1.4.2 would perform compared to the (old) VMs you tested on…

Erik

Woz · December 11, 2003, 6:53pm

For those interested in dots.

Appologies for the large post, and you probably want to go full screen to get the table data lined up (sorry).

The clock cycles given include the loop over head, the top value is the average cycles, the value below it is the median.
CPU = 1Ghz PIII coppermine with approx CPU Clock Speed = 997453224 Cycles per second

Clock Cycles for bitIndex2().
[tr]
[td]
335.163
335
ibm13index2e.csv
Thread priority = 5
Classic VM
IBM Corporation
1.3.0
1.3.0
IBM Corporation
[/td]
[td]
462.998
463
ms114index2e.csv
Thread priority = 5
null
null
null
1.1.4
Microsoft Corp.
[/td]
[td]
334.378
334
sun12index2e.csv
Thread priority = 5
Classic VM
Sun Microsystems Inc.
1.2
1.2
Sun Microsystems Inc.
[/td]
[td]
612.59
612
sun131index2e.csv
Thread priority = 5
Java HotSpot™ Client VM
Sun Microsystems Inc.
1.3.1-b24
1.3.1
Sun Microsystems Inc.
[/td]
[td]
626.142
626
sun141index2e.csv
Thread priority = 5
Java HotSpot™ Client VM
Sun Microsystems Inc.
1.4.1-b21
1.4.1
Sun Microsystems Inc.
[/td]
[td]
635.11
635
sun142index2e.csv
Thread priority = 5
Java HotSpot™ Client VM
Sun Microsystems Inc.
1.4.2_01-b06
1.4.2_01
Sun Microsystems Inc.
[/td]
[/tr]

Clock cycles for bitIndex4()
[tr]
[td]
216.247
216
ibm13index4e.csv
Thread priority = 5
Classic VM
IBM Corporation
1.3.0
1.3.0
IBM Corporation
[/td]
[td]
354.651
354
ms114index4e.csv
Thread priority = 5
null
null
null
1.1.4
Microsoft Corp.
[/td]
[td]
308.959
309
sun12index4e.csv
Thread priority = 5
Classic VM
Sun Microsystems Inc.
1.2
1.2
Sun Microsystems Inc.
[/td]
[td]
501.409
501
sun131index4e.csv
Thread priority = 5
Java HotSpot™ Client VM
Sun Microsystems Inc.
1.3.1-b24
1.3.1
Sun Microsystems Inc.
[/td]
[td]
487.529
487
sun141index4e.csv
Thread priority = 5
Java HotSpot™ Client VM
Sun Microsystems Inc.
1.4.1-b21
1.4.1
Sun Microsystems Inc.
[/td]
[td]
470.736
471
sun142index4e.csv
Thread priority = 5
Java HotSpot™ Client VM
Sun Microsystems Inc.
1.4.2_01-b06
1.4.2_01
Sun Microsystems Inc.
[/td]
[/tr]

Clock cycles for bitIndex5()
[tr]
[td]
208.485
208
ibm13index5e.csv
Thread priority = 5
Classic VM
IBM Corporation
1.3.0
1.3.0
IBM Corporation
[/td]
[td]
342.374
342
ms114index5e.csv
Thread priority = 5
null
null
null
1.1.4
Microsoft Corp.
[/td]
[td]
278.477
278
sun12index5e.csv
Thread priority = 5
Classic VM
Sun Microsystems Inc.
1.2
1.2
Sun Microsystems Inc.
[/td]
[td]
466.484
466
sun131index5e.csv
Thread priority = 5
Java HotSpot™ Client VM
Sun Microsystems Inc.
1.3.1-b24
1.3.1
Sun Microsystems Inc.
[/td]
[td]
460.837
461
sun141index5e.csv
Thread priority = 5
Java HotSpot™ Client VM
Sun Microsystems Inc.
1.4.1-b21
1.4.1
Sun Microsystems Inc.
[/td]
[td]
450.049
450
sun142index5e.csv
Thread priority = 5
Java HotSpot™ Client VM
Sun Microsystems Inc.
1.4.2_01-b06
1.4.2_01
Sun Microsystems Inc.
[/td]
[/tr]
[tr][td]-----[/td][/tr]
[tr][td]Speed gain when
using bitIndex5()
instead of using
bitIndex2()[/td]
[/tr]
[tr]
[td]161%[/td]
[td]135%[/td]
[td]120%[/td]
[td]131%[/td]
[td]135%[/td]
[td]141%[/td]
[/tr]

As you can clearly see it is worth using bitIndex5() over bit Index2() giving at worst a 20% speed increase (at best a 61% increase).
If you want to know which JVM was faster on my machine, well that goes (unsurprisingly) to IBM’s 1.3.

Unfortunately I didn’t get to include Kaffe due to the version I have being broken on Windows. And I didn’t succeed in compiling a .exe via GNU’s GCJ, due to a jni link error which is most likely my fault (but haven’t finished with this yet).

Woz.

NativeOne · December 11, 2003, 10:31pm

Thanks a lot for your help. So basically bitIndex2() seems to be the right thing to do - just not in the fastest variant. Originally I hoped that maybe someone would come up with a better algorithm, maybe exploiting some obscure bit fumbling.

Because my work is in the early stages I’ll consider switching to C/C++ and later optimize with assembly. Could be quite fun and challenging - maybe I’ll toast my CPU : I wonder how a C++ and then an asm version would compare in terms of cycles.

Can it be true, that Sun 1.2 is faster than the successor VMs? Strange…

Merry Christmas & Happy new Year
NativeOne

crystalsquid · December 12, 2003, 8:34am

One area for improvement is the use of branches (ifs).
You can do the same with some cunning bit-fumbling
This is also the best way to approach an asm version as you can pipeline the logical ops much better than using branches.

Note also that the bit pattern was swapped for the test as this trick wouldn’t work if the top bit was set.


 public static int bitIndex6(long x) 
  { 
    int r = 0; 
    int a = (int)x; 
    int b = (int)(x>>32)
 
//    if( a == 0 ) 
//    { 
//        a = (int)(x >> 32); 
//        r += 32; 
//    } 
    int c = (a>>32) | ((-a)>>32);  // 0 if a is 0
   a = (a & c) | (b & (~c));  // a = b if a was 0
   r = 32 & (~c);

    //if ((a & 0xFFFF0000) != 0) r += 16; 
    b = a&0x0000ffff;
    b = ~((-b)>>31);  // 0 if b was non-zero
    r |= 16 & b;

    //if ((a & 0xFF00FF00) != 0) r += 8; 
    b = a&0x00ff00ff;
    b = ~((-b)>>31);  // 0 if b was non-zero
    r |= 8 & b;

    //if ((a & 0xF0F0F0F0) != 0) r += 4; 
    b = a&0x0f0f0f0f;
    b = ~((-b)>>31);  // 0 if b was non-zero
    r |= 4 & b;

    //if ((a & 0xCCCCCCCC) != 0) r += 2; 
    b = a&0x33333333;
    b = ~((-b)>>31);  // 0 if b was non-zero
    r |= 2 & b;

    //if ((a & 0xAAAAAAAA) != 0) r++; 
    b = a&0x55555555;
    b = ~((-b)>>31);  // 0 if b was non-zero
    r |= 1 & b;

    return r; 
  }

Don’t know if this would be faster in java, but writing the asm or C++ like this should be faster overall (no branch prediction errors can occur).

Also, this is straight out of my head so apologies for any typos / errors

Dom

erikd · December 12, 2003, 9:43am

[quote]If you want to know which JVM was faster on my machine, well that goes (unsurprisingly) to IBM’s 1.3.
[/quote]
Yeah, it’s a shame it seems they don’t develop VM’s for windows anymore :-[

Erik

Woz · December 12, 2003, 6:00pm

Erik,

IBM still do make a JVM, its just part of Web Sphere now.
http://www-106.ibm.com/developerworks/java/jdk/
IBM typically runs a bit behind in the version numbers though, they were at version 1.1.8 for ages.

But what I don’t understand is why they have a version of 1.4.1 for Linux but not windows, the Linux crowd have had it for ages too.

I completely agree with Cystal Squid, conditional branches kill modern CPUs (especially the P4 with its super long pipeline). It would be interesting to try and make as much of that code run in parallel (super scalar architecture). But do you know what! The BSF opcode is supposed to run in 1-2 clocks on the PPro, PII and above!!!

Native One I would not necessarily say that SUN 1.2 is faster than 1.4 series as this is only one test, the rest of the JVM may very well suck! It may not scale at all, have bad array access and a poor I/O sub system.

One thing you can note is that there is a difference between JITs and Hotspots, the JIT compilers like your code, Hotspots - for what ever reason do not. IBM’s JVM is a Hotspot compiler but they seem to have managed to keep all the benefits of JIT too. Unfortunately all these JVM’s are closed source so it really difficult to work out the actual reason for it. Also this is only on MY machine, it is entirely possible to be the other way around on your PC. Although typically all PII & PIII systems would behave the same in this test because the memory sub system is hardly used.

But then of course this is not a “live” environment, the code does not actually do anything and it is not used the way you intend it to be used. The code is so small that the entire program should fit in the L1 cache of the CPU, something that probably wont happen when your chess program is finished.

As to which programming language you should use, I see no reason why you should not use Java, its (relatively) easy to use, and changes can be made quickly. OK speed of execution “could” be a problem, but if this is your first go at a chess program its not really going to be an issue, your hardly going to be running the thing against Deep Thought are you!!??

Anyway hope you found this all useful, if not at least interesting!

All the best,

Woz.

cfmdobbie · December 12, 2003, 7:14pm

No one’s mentioned the usual optimization mantra:

“Premature optimization is the root of all evil” - Donald Knuth

It’s usually better to not worry about these kind of things right now, and detect any slowdowns once you start profiling. It may be that in the real environment Java will perform a lot better than your micro-benchmark predicts, or that there are much bigger bottlenecks elsewhere.

Unless you’re sure this is going to be a big deal - exactly how frequently is this method going to be called?

erikd · December 12, 2003, 8:24pm

[quote]Erik,

IBM still do make a JVM, its just part of Web Sphere now.
http://www-106.ibm.com/developerworks/java/jdk/
IBM typically runs a bit behind in the version numbers though, they were at version 1.1.8 for ages.

But what I don’t understand is why they have a version of 1.4.1 for Linux but not windows, the Linux crowd have had it for ages too.
[/quote]
Yes, you’re right. It’s just that I’ve given up hope after waiting so long for 1.4 for windows. I’m very impatient and I can’t use 1.3.1 for my games
OTOH, I have no real problems with SUN’s offering as it’s stable and usually fast enough for what I do. I actually think Sun is doing a great job in general.
But I do sometimes question Sun’s statements that their JVM can reach C++ speeds with well written code if in all cases I tried, IBM’s 1.3.1 JVM outperformed Sun’s JVM’s by a factor of 2 and up and I’m not talking about microbenchmarks or just my own crappy code… Hmmm… :-/

Erik

Jeff · December 13, 2003, 3:56am

[quote]But I do sometimes question Sun’s statements that their JVM can reach C++ speeds with well written code if in all cases I tried,
[/quote]
The magic words. I’ve said it before and I’ll say it again. Microbenchmarking is hard and interpreting the results is even harder.

I have not seen such a difference in recent benchmarks. In fact I believe in SpecJVM, which was the standard CPU benchmark for Java for quite some time, we and IBM basically payed ping pong in who was on top for a long time.

In general the differences between our VM and IBMs have had a lot to do with what you were measuring. A few versions back, the big differences you saw were that if you measured array access they won, because we hadn’t implemented bounds check hoisting yet, and if you measured memory allocation/deallocation we won because they didn’t have generational allocation and collection. A good example of how these trade-offs
play out can be seen in Chris Rijk’s excellent (if now old) article “Binaries v. Bytecodes.”

Since then (starting with 1,.4) we’ve got the bounds check hoisting. I don’t know what they’ve done or not done about their allocator/collector.

Jeff · December 13, 2003, 4:04am

[quote]For those interested in dots.
[/quote]
(1) Are you comparing apples vs apples? (IBM server VM v. Sun Server VM)?
(2) Are you running the Sun VM with -Xcompile and properly warming the VM up?? If not the results could simply be due to IBm deciding to kick into compiled code sooner.

In general, AFAIK a properly run SpecJVM does not show such a difference between IBM and our VM, which makes me suspect your methodology.

Jeff · December 13, 2003, 4:06am

Okay Woz, you are DEFINITELY running the VM wrong.

I see classic (which doesn’t even exist anymore) and Client, but NO server VM numbers.

IBM OTOH i believe ONLY has a server VM.

Try apples v. apples to start

Woz · December 13, 2003, 10:29am

Wow Jeff calm down!

The comparison was which routine was quicker, for me its bitIndex5().

This was the IBM Run Time JVM I downloaded and use, IBM has either moved or removed the page since, but google still caches it.

Google Cache

This has a RunTime installable package, ibm-cpt-rn20.exe.

Planet mirror still has it:
http://public.planetmirror.com/pub/java-ibm/1.3.0/win32/

I did not use the -Xcompile switch with Suns 1.4.x (or any of the other JVMs), although I allowed time for the JVMs to warm up.

Shall have a go and see if it makes any difference, when I’ve got the some time free.

Woz.

blahblahblahh · December 13, 2003, 12:12pm

To what extent is this because Sun creates the new versions of Java, and IBM has to wait and see what version X + 1 is and does, and what has changed, before they can write their version? (I assume these days major Java outfits like IBM get v. early access to planned architectural and library change information?)

My very limited understanding is that this used to be a major reason why IBM had a long time lag with keeping up with the Sun releases, along with the fact that they devoted a heck of a lot more resources to Java dev than Sun did (this is going back some years, but was certainly true for a long time; IBM touted the fact that they had 5 java devs for every 1 Sun java dev (or whatever the ratio was) probably for longer than was strictly fair, but… ). But then IBM has 300k employees :), so it perhaps wasn’t hard.

I’ve kind of assumed that IBM has been the driving force behind making Sun make Java faster and better - you don’t want to be eclipsed (pun unintended) in your own language by a competitor - and this is a good thing; IBM spends longer on their JVM in an attempt (amongst other things) to get one over on Sun, which pushes Sun to devote more resources and push a bit harder at making the next iteration even better. Both companies are extremely good at VM development - IBM historically is significantly better, but many of their experts moved to Sun (and vice versa) over the years, so nowadays it’s presumably fairly equal; with the continued demise of Compaq (and all they swallowed) is there anyone left in the world with VM knowledge and resources that even come close to Sun/IBM?

Compare this to the MS monopolies, where MS rarely has any reason to improve e.g. Word, and the software just stagnates. IMHO. I just want to see IBM and Sun keep pushing each other, and not resting on laurels, and hopefully never running out of major new features etc to beat each other on (or the same stagnation may set in).

abies · December 13, 2003, 3:02pm

I don’t know if Intel still develops it in house, but they had quite reasonable team working on orp jvm. Effort was mainly driven by need to have state of art jvm for IA64 architecture, which is non-trivial to say the least as far as compilers are concerned, which directly translates to jit implementation. At the same time they have open sourced x86 implementation.

It is dead from what can be seen outside - no way of guessing if they have just focused on IA64 and stopped pushing patches to open source version or gave up alltogether (IA64 is not exactly a best selling processor after all…). Anyway, team is somewhere there, inside Intel research lab, waiting to take over the world when the hour comes.

Jeff · December 14, 2003, 3:33am

None.

IBM has its own version of Java, its own source base. they don’t have to wait for ours. They are equal members of the JCP and know when the JSR is final as soon as we do.

By the way, I believe they are still full licensees which means that as soon as we finish OUR VM, they get to see the source.

The reverse is not true.

Acutally EVERYONE gets the info at more or less the same time, when the JSR is approved by the JCP J2SE Executive Comittee. IBM and anyone else who wants to join the expert group for the JSR has a slight lead in that they are invovled in the debates that lead to the version sent to the EC.

How could devoting more resources be a reason for delays? Maybe I’m not following you as that dpesnt seem very sensible to me?

I don’t know much of the internals of IBM’s Java development so I can’t comment on it beyond what I know. I do know that they have their development split in half (or did last time I talked to an IBm person which was a few yearsd ago.) They have a core VM group that develops the base of their technology and then porting groups that port it to all the platforms they support (which is a lot, they have a lot of old legacy systems like 360s and VM (thats an IBM operating system) machines. Likely they are stil lsupporting AIX users, etc.

Its a deceiving number as they are suypporting many more platforms then we are. and structure their development totally differently. You really cant just count heads.

You are half right. We have continually driven each others development foward. Its a bilateral sort of thing.

and yes, competition is a very good thing

Imagine how much better (or at least more stable) Windows woudl be if MSFT had competitors…

Jeff · December 14, 2003, 3:40am

[quote]Wow Jeff calm down!
[/quote]
Sorry but “IBM is always twice is fast” I know isnt true and is fighting words Remember I was ON the JDK1.3 tunign team and we did heavy measurements of oruselves AND competitors.

IBM’s VM (which is a server style VM) is always twice as fast as Sun’s CLIENT VM I believe. Which is what it looks liek yo umeaured.

Make sure to run with -Xcompile and -server and you might get a fairer comaprsions. Ofcourse microbenchmarks are horribly prone to inaccuracies anyway in that they don’t reflect the way real code executes which is what we at Sun tune for.