Finding index of single 1 bit in a long

[quote]Sorry but “IBM is always twice is fast” I know isnt true and is fighting words :slight_smile:
[/quote]
Maybe you’re referring to my remark, and in that case
I like to reemphasize that I didn’t say that IBM is always
twice as fast. I said that in the cases I tried, IBM was
always faster. I know those cases do not give an general conclusion about IBM vs SUN, but cases of interest to me at most. And I was testing IBM 1.3 against SUN’s 1.3 (apples to apples and all :)).
I haven’t really tried IBM 1.3 against recent Sun VM’s mucg, since I do different things with Sun’s VM that are simply not possible with 1.3.

Just to make myself a bit more clear…

Erik

[quote]I said that in the cases I tried, IBM was
always faster. I know those cases do not give an general conclusion about IBM vs SUN, but cases of interest to me at most. And I was testing IBM 1.3 against SUN’s 1.3 (apples to apples and all :)).
[/quote]
Ah. IBM had bounds check hoisting in their 1.3 . We didn’t have it until our 1.4. Thats likely the difference you were seeing in most of your tests. (Not actually seeing the tests I can’t tell for sure.)

OTOH Since our 1.4 was out not too much later then their 1.3 its arguable that 1.3 to 1.3 is the correct comparison or whether 1.3 to 1.4 is.

As a general rule I always try to compare whatever are the most current versions of whatever VMs Im comparing, regardless of their version numbers.

[quote]Ah. IBM had bounds check hoisting in their 1.3 . We didn’t have it until our 1.4.
[/quote]
Well, in the server VM that is (or is it implemented in the client too lately?). For games, people usually use the client VM.
IBM’s VM might be a server but ‘feels’ like a client VM too (warming up times).

[quote] Thats likely the difference you were seeing in most of your tests. (Not actually seeing the tests I can’t tell for sure.)
[/quote]
Likely. At the time I was much into emulation having array access everywhere.

Erik

(sorry for hijacking this thread, btw)

Not really true at all. Go ask Shawn Kendall :slight_smile:

Can’t really comment as thats a subjective measure. You can
make Hotspot “warm up” right away easily enough by saying -Xcompile.

[quote]Not really true at all. Go ask Shawn Kendall
[/quote]
Hmmm… Well it’s troublesome enough to get people download the JRE, let alone the whole SDK to get the server VM…
You surely can’t be serious that most people have the SDK already?

Yes it is.

Yes I know, but it would be some hassle to get for example web start to use the server VM, and use -Xcompile.
I vote for merging the client and server VMs :slight_smile:

[quote]or example web start to use the server VM, and use -Xcompile.
I vote for merging the client and server VMs :slight_smile:
[/quote]
Well IMO you will always want different configuration parameters for client and server operation even if at the end they are running on the same codebase.

The real problem here is that you say you cannot pass VM flags to Webstart. Thats pretty intolerable if true and needs to be fixed.

Speaking of which it would be good to start a thread thats JUSt a list of open Webstart issues. I think I’ll ask Chris for a Webstart topic of its own.

Well, (“safe” - I guess that’s rather vague ;D ) vm args are supposedly supported in tiger:
http://developer.java.sun.com/developer/bugParade/bugs/4790673.html

Yes, please do! In other discussion groups / forums / mailing lists / sites I come across quiite a few java games devs who are struggling with or debating using JWS.

It would be really helpful if I could point them at such a forum whenever trouble came up, and so help route more people to the same particular point.

I tell them to try JGO anyway, but currently you have to put in quite a lot of effort to e.g. find out about a single small thing like JWS from within JGO.

Well I’ve started a JWS RFE’s thread in “On-line Games”, but i agree JWS really needs its own topic. Im bugging chris about this now…

I think you can do this without any branching… theoretically it could be faster.

If you can rewrite the IFs using a mask to effectively add zero when the condition is false it might keep the processor pipeline full and be marginally faster.

You may be able to exploit properties of the number… E.g. recognize that binary numbers with a single bit set minus 1 give an interesting value that could make a good mask for ANDing, e.g. 10000 - 1 = 1111.

I came up with this…


      public static int getIndex(int v)
      {
            int m1 = v-1;
            int m2 = ((v & 0xAAAAAAAA)-1+v);
            int m3 = (m1-m2)>>31;
            int i = m3 & 1;

            m2 = ((v & 0xCCCCCCCC)-1+v);
            m3 = (m1-m2)>>31;
            i += m3 & 2;

            m2 = ((v & 0xF0F0F0F0)-1+v);
            m3 = (m1-m2)>>31;
            i += m3 & 4;

            m2 = ((v & 0xFF00FF00)-1+v);
            m3 = (m1-m2)>>31;
            i += m3 & 8;

            m2 = ((v & 0xFFFF0000)-1+v);
            m3 = (m1-m2)>>31;
            i += m3 & 16;
            return i;
      }

I haven’t investigated if this can be further simplified.

[quote]I think you can do this without any branching… theoretically it could be faster.
[/quote]
Thats what I said, back before this thread was so succesfully jacked :stuck_out_tongue:

I had it down to (for just one test block):

[quote] b = a&0x0000ffff;
b = ~((-b)>>31); // 0 if b was non-zero
r |= 16 & b;
[/quote]
which is 6 logical ops per bit of the output:
AND, NEGATE,SHIFT,INVERT,AND,OR
and you have:

[quote] m2 = ((v & 0xFFFF0000)-1+v);
m3 = (m1-m2)>>31;
i += m3 & 16;
[/quote]
which I make as 7 ops:
AND,DEC,ADD,SUB,SHIFT,AND,ADD

Although Im not entirely sure how your version works - I love to see bit tricks I don’t understand. Gives me a puzzle to solve :wink:

  • Dom

Ah… sorry i must have skimmed through the posts too fast… you’ve beaten me by 1 op… our last lines are equivalent as I could use OR instead… however looking at your code gave me an idea to simplify mine…


int m1 = ((x & 0x0000FFFF) - 1) >> 31;
i |= 16 & m1;

I think that should work. AND, SUB, SHIFT, AND, OR… down to 5 ops.

The AND and the -1 basically create a negative number if none of the bits we are testing were set, and it makes sure that if the high bit was the one that was set that it will not be set anymore. The shift is used to create a mask relying on the high bit being sign extended after the arithmetic shift. After that it is the simple ORing of the bit in the result…

edit fixed mask - needs to be the same as yours now otherwise the result needs to be subtracted from 31 :slight_smile:

:o Sweet!
Woz - any chance of a timing test for this version? :wink:

  • Dom

Here you go Dom,

It takes anywhere between 1-2 hours to gather the results for each test, so you’ll have to forgive the slight delay.

Some slight code modifications on the each bitIndex() were required, as Dom wrote bitIndex6() off the top of his head, and Pamler’s initial version was only accepted integers as parameters.

So here are the modifed versions:


  public static int bitIndex6(long x)
  {
    int r = 0;
    int a = (int)x;
    int b = (int)(x>>32);

//    if( a == 0 )
//    {
//        a = (int)(x >> 32);
//        r += 32;
//    }

    int c = ( a >> 31) | (( -a ) >> 31);  // 0 if a is 0
    a = (a & c) | (b & (~c));  // a = b if a was 0
    r = 32 & (~c);

    //if ((a & 0xFFFF0000) != 0) r += 16;
    b = a & 0x0000ffff;
    b = ~(( -b ) >> 31);  // 0 if b was non-zero
    r |= 16 & b;

    //if ((a & 0xFF00FF00) != 0) r += 8;
    b = a & 0x00ff00ff;
    b = ~(( -b ) >> 31);  // 0 if b was non-zero
    r |= 8 & b;

    //if ((a & 0xF0F0F0F0) != 0) r += 4;
    b = a & 0x0f0f0f0f;
    b = ~(( -b ) >> 31);  // 0 if b was non-zero
    r |= 4 & b;

    //if ((a & 0xCCCCCCCC) != 0) r += 2;
    b = a & 0x33333333;
    b = ~(( -b ) >> 31);  // 0 if b was non-zero
    r |= 2 & b;

    //if ((a & 0xAAAAAAAA) != 0) r++;
    b = a & 0x55555555;
    b = ~(( -b ) >> 31);  // 0 if b was non-zero
    r |= 1 & b;

    return r;
  }


  public static int bitIndex8(long x)
  {
    int r = 0;
    int a = (int)x;
    int b = (int)(x>>32);
    int c = ( a >> 31) | (( -a ) >> 31);  // 0 if a is 0

    a = (a & c) | (b & (~c));  // a = b if a was 0
    r = 32 & (~c);

    b = ((a & 0x0000FFFF) - 1) >> 31;
    r |= 16 & b;

    b = ((a & 0x00FF00FF) - 1) >> 31;
    r |= 8 & b;

    b = ((a & 0x0F0F0F0F) - 1) >> 31;
    r |= 4 & b;

    b = ((a & 0x33333333) - 1) >> 31;
    r |= 2 & b;

    b = ((a & 0x55555555) - 1) >> 31;
    r |= 1 & b;

    return r;
  }

And a most evil looking version, in an attempt to (some what doubtfully) get the method to inline.


  public static int bitIndex9(long x)
  {
    int a = (int)x;
    int c = (a | -a) >> 31;
    a = (a & c) | (((int)(x >> 32)) & ~c);

    return ((32 & (~c)) | (16 & (((a & 0x0000FFFF) - 1) >> 31)) | (8 & (((a & 0x00FF00FF) - 1) >> 31)) | (4 & (((a & 0x0F0F0F0F) - 1) >> 31)) | (2 & (((a & 0x33333333) - 1) >> 31)) | (1 & (((a & 0x55555555) - 1) >> 31)) );
  }

Results in next post.


Woz.

bitIndex6()

[tr]
[td]456.719
457
ibm13index6e.csv
Thread priority = 5
Classic VM
IBM Corporation
1.3.0
1.3.0
IBM Corporation
[/td]

[td]486.462
486
ms114index6e.csv
Thread priority = 5
null
null
null
1.1.4
Microsoft Corp.
[/td]

[td]361.481
361
sun12index6e.csv
Thread priority = 5
Classic VM
Sun Microsystems Inc.
1.2
1.2
Sun Microsystems Inc.
[/td]

[td]591.169
591
sun131index6e.csv
Thread priority = 5
Java HotSpot™ Client VM
Sun Microsystems Inc.
1.3.1-b24
1.3.1
Sun Microsystems Inc.
[/td]

[td]728.77
729
sun141index6e.csv
Thread priority = 5
Java HotSpot™ Client VM
Sun Microsystems Inc.
1.4.1-b21
1.4.1
Sun Microsystems Inc.
[/td]

[td]606.939
607
sun142index6e.csv
Thread priority = 5
Java HotSpot™ Client VM
Sun Microsystems Inc.
1.4.2_01-b06
1.4.2_01
Sun Microsystems Inc.
[/td]
[/tr]

bitIndex8()
[tr]
[td]418.503
418
ibm13index8e.csv
Thread priority = 5
Classic VM
IBM Corporation
1.3.0
1.3.0
IBM Corporation
[/td]

[td]455.988
456
ms114index8e.csv
Thread priority = 5
null
null
null
1.1.4
Microsoft Corp.
[/td]

[td]337.237
337
sun12index8e.csv
Thread priority = 5
Classic VM
Sun Microsystems Inc.
1.2
1.2
Sun Microsystems Inc.
[/td]

[td]567.88
568
sun131index8e.csv
Thread priority = 5
Java HotSpot™ Client VM
Sun Microsystems Inc.
1.3.1-b24
1.3.1
Sun Microsystems Inc.
[/td]

[td]627.207
627
sun141index8e.csv
Thread priority = 5
Java HotSpot™ Client VM
Sun Microsystems Inc.
1.4.1-b21
1.4.1
Sun Microsystems Inc.
[/td]

[td]574.385
574
sun142index8e.csv
Thread priority = 5
Java HotSpot™ Client VM
Sun Microsystems Inc.
1.4.2_01-b06
1.4.2_01
Sun Microsystems Inc.
[/td]
[/tr]
Evil looking bitIndex9()
[tr]
[td]22.033
22
ibm13index9e.csv
Thread priority = 5
Classic VM
IBM Corporation
1.3.0
1.3.0
IBM Corporation
[/td]

[td]463.16
463
ms114index9e.csv
Thread priority = 5
null
null
null
1.1.4
Microsoft Corp.
[/td]

[td]322.418
322
sun12index9e.csv
Thread priority = 5
Classic VM
Sun Microsystems Inc.
1.2
1.2
Sun Microsystems Inc.
[/td]

[td]523.457
523
sun131index9e.csv
Thread priority = 5
Java HotSpot™ Client VM
Sun Microsystems Inc.
1.3.1-b24
1.3.1
Sun Microsystems Inc.
[/td]

[td]523.457
523
sun131index9e.csv
Thread priority = 5
Java HotSpot™ Client VM
Sun Microsystems Inc.
1.3.1-b24
1.3.1
Sun Microsystems Inc.
[/td]

[td]513.895
514
sun142index9e.csv
Thread priority = 5
Java HotSpot™ Client VM
Sun Microsystems Inc.
1.4.2_01-b06
1.4.2_01
Sun Microsystems Inc.
[/td]
[/tr]

It would appear that IBM’s JVM optimized bitIndex9() out!

As a whole they appear to be slower (than bitIndex5() ) under microbench mark conditions. Personnally I was expecting that the JIT compilers would be considerably slower due to most JIT’s being nothing more than a Interpreter without the overheads of interpreting (!). Interpreters dont reorder code they execute each byte code as it comes.

All but one JVM is able to beat the original bitIndex2().

Note that the 1.4.2 -server -Xcompile scores are not present, as I have not finished playing with it yet.
I have tested it on bitTest2, 4 & 5.
Using -Xcompile made no difference to the client JVM as neither did changing the amount of “warm up” time.

The server version when using just the -server switch manages to optimize out the the benchmark completely giving a score of zero. When combined with -Xcompile a score is presented, it has the greatest gain (percentage increase over bitIndex2() ) in the order of 190% for bitIndex4(), which is faster that bitIndex5(). But is using 308 clocks to do it.

Obviously the microbenchmark is going to have to be modified in order to stop the code being optimized out.

It would be interesting to try them on a P4 which has a very long pipeline, so surely bitIndex9() would be faster than bitIndex5().

Cold fish all round then!

Woz.

My code is a bit tighter than your bitIndex9…


      public static int getIndex(int v)
      {
            return ((((v & 0x55555555)-1)>>31) & 1)
                 | ((((v & 0x33333333)-1)>>31) & 2)
                 | ((((v & 0x0F0F0F0F)-1)>>31) & 4)
                 | ((((v & 0x00FF00FF)-1)>>31) & 8)
                 | ((((v & 0x0000FFFF)-1)>>31) & 16);
      }

Which you can trivially extend to work on a long intead of an int.

** EDIT ** Actually… hmmm… nevermind…I see what you are doing…

Note: if you run the ‘long’ version on a G5 Mac it will use 64bit instructions and registers. Without 64 bit native operations your ‘a’ and ‘c’ calculations might make a difference. Though I think that can be written a little bit more efficiently too.
Once you know to OR in 32 you can do something like this and follow through with the ‘int’ version.


int bitIndex(long v)
{
...
int i = (int)(v | (v >>> 32));
...
do int version on i
}

Quick tip for microbenchmarkers:
Just after you calculate your total time, System.out.println() the sum of the results of calling the function (for example) which prevents the methods being optimized away.

Cas :slight_smile:

Woz: Cheers for the timings (& the fixes) :slight_smile:
Shame they’re not any faster :-/ Interesting though. I wonder if its not managing to interleave the dependancy chains for each output bit.

  • Dom

What does the test harness look like?
You would think that the branch prediction of modern processors would be totally defeated by this stuff since each branch has exactly a 50:50 chance of being taken.
I would like to try this on the Mac to see if the PowerPC handles things differently.

Would be interesting to see results from a Crusoe processor running a similar test :slight_smile: