Fastest get2fold algorithm

swpalmer · March 26, 2006, 4:05am

The conditions are wrong. The above fails for 0x101.

Riven · March 26, 2006, 8:58am

And then, you don’t need any ‘else’ statements… just

if(…)
return …;
if(…)
return …;
if(…)
return …;
if(…)
return …;

DzzD · March 26, 2006, 12:54pm

you are right, I think about it while writing the post, so there could be other mistake, anyways it looks to be a fast algo, have you tried to benchmark this code?

swpalmer · March 26, 2006, 6:42pm

I tried a few algorithms… one was branchless but the equation was too large and it performed very poor compared to the while (1 << i) loop version.

Here’s another attempt that is more like a divide & conquer approach, it does 25-30% better than the while loop for me:


	public static int p2Bound4(int n) {
		int h1 =  n & 0xffff0000;  // 11111111111111110000000000000000
		if(h1 == 0)
			h1 = n & ~0xffff0000;
		
		int h2 = h1 & 0xff00ff00;  // 11111111000000001111111100000000
		if (h2 == 0)
			h2 = h1 & ~0xff00ff00;
		
		int h3 = h2 & 0xf0f0f0f0;  // 11110000111100001111000011110000
		if (h3 == 0)
			h3 = h2 & ~0xf0f0f0f0;
		
		int h4 = h3 & 0xcccccccc;  // 11001100110011001100110011001100
		if (h4 == 0)
			h4 = (h3 & ~0xcccccccc);
		
		int h5 = h4 & 0xaaaaaaaa;  // 10101010101010101010101010101010
		if (h5 == 0)
			h5 = (h4 & ~0xaaaaaaaa);
		
		// h5 is the highest set bit in 'n', so we likely need to shift by 1
		return h5 << (h5-n >>> 31);
	}

Color_Of_Green · March 27, 2006, 2:04am

Excluding the raw lookup table approach, the p2Bound4 algorithm is the fastest so far, by about 20%… nice.

mabraham · March 27, 2006, 8:01pm

Hey guys,

This is what I use:


   private static int nextPow2( int value )
   {
      if ( ( value & ( value - 1 ) ) == 0 )
      {
         // already a power-of-2 value
         return value;
      }
      return Integer.highestOneBit( value ) << 1;
   }

It uses new 1.5 API, it uses no branches (the Integer.highestOneBit() API doesn’t anyway), and I believe it should perform pretty well, though I haven’t benchmarked it yet.

Color_Of_Green · March 28, 2006, 1:12am

Interesting, on my centrino laptop the BS method is actually 2x faster than p2bound4 but on my P4 its slower??

HamsterofDeath · March 31, 2006, 9:39am

   public static final int get2FoldUnrolled(int n)
   {
      return n < 0
         ? 0xFFFFFFFF
         : n > 0x00010000
            ? n > 0x01000000
            ? n > 0x10000000
            ? n > 0x40000000
            ? 0xFFFFFFFF
            : (n > 0x20000000) ? 0x40000000 : 0x20000000
            : n > 0x04000000 ? (n > 0x08000000) ? 0x10000000 : 0x08000000 : (n > 0x02000000) ? 0x04000000 : 0x02000000
            : n > 0x00100000
               ? n > 0x00400000
               ? (n > 0x00800000) ? 0x01000000 : 0x00800000
               : (n > 0x00200000) ? 0x00400000 : 0x00200000
               : n > 0x00040000
                  ? (n > 0x00080000) ? 0x00100000 : 0x00080000
                  : (n > 0x00020000) ? 0x00040000 : 0x00020000
            : n > 0x00000100 ? n > 0x00001000
               ? n > 0x00004000
               ? (n > 0x00008000) ? 0x00010000 : 0x00008000
               : (n > 0x00002000) ? 0x00004000 : 0x00002000
               : n > 0x00000400
                  ? (n > 0x00000800) ? 0x00001000 : 0x00000800
                  : (n > 0x00000200) ? 0x00000400 : 0x00000200 : n > 0x00000010 ? n > 0x00000040 ?
               (n >
                0x00000080)
                  ?
                  0x00000100
                  :
                     0x00000080 :
               (n >
                0x00000020)
                  ?
                  0x00000040
                  :
                     0x00000020 :
               n >
               0x00000004
                  ?
                  (n >
                   0x00000008)
                     ?
                     0x00000010
                     :
                        0x00000008
                  :
                     (n >
                      0x00000002)
                        ?
                        0x00000004
                        :
                           n;

   }

scnr

oravecz · April 3, 2006, 1:10pm

On my machine, some of the more optimized solutions are clocking in at less than 100 ms for 5,000,000 iterations. Any of the algorithms presented thus far are suitable replacements for the java.lang.Math approach, but it is fun to see the different approaches to the problem. In the back of your mind, you begin to feel that there is an extremely elegant and simple solution to this problem, and the highestOneBit algorithm probably comes closest.

For the sake of summary, I have collected all of the algorithms mentioned so far into a single test harness that verifies their accuracy and performs a micro-benchmark. As swpalmer noted, the lookup code produces inaccurate results. This may be caused by my initialization of the lookup table.

Since this is a micro-benchmark, take it with a grain of salt. I would expect some algorithms to be faster than others and vice-versa when run on different OS, processor, ot Java VM. (Actually, I wouldn’t expect this, but it turns out that way. Maybe it is just the VM that makes all of the difference.)

Here are the results of my tests:

Windows P4 2.8GHz
java version “1.4.2_10”
Java™ 2 Runtime Environment, Standard Edition (build 1.4.2_10-b03)
Java HotSpot™ Client VM (build 1.4.2_10-b03, mixed mode)

>java -cp . PowerOfTwo
-- Accuracy run
All methods return the exact same results.
-- Test runs with 5000000 iterations.
Math.log method: 6219, 6453, 6453, 6469, 6453
Rolled out binary search method: 94, 94, 93, 93, 94
Shifty method: 281, 313, 313, 313, 312
Logical method: 156, 156, 156, 141, 141
p2Bound4 method: 141, 140, 156, 141, 141
highestOneBit method: 156, 141, 141, 125, 140
get2FoldFlattened method: 93, 94, 94, 94, 94
shiftUp method: 235, 250, 250, 234, 250

The 1.4 VM shows the Math.log code more than a magnitude slower than some of the optimized routines, as one might expect. The BS approach, that started this thread, by Dirk Bosmans and Vincent Vollers still performs the best in this configuration. The approaches that use some type of bit-shifting are close behind.

Windows P4 2.8GHz
java version “1.4.2_10”
Java™ 2 Runtime Environment, Standard Edition (build 1.4.2_10-b03)
Java HotSpot™ Server VM (build 1.4.2_10-b03, mixed mode)

>java -cp . -server PowerOfTwo
-- Accuracy run
All methods return the exact same results.
-- Test runs with 5000000 iterations.
Math.log method: 2953, 2891, 2953m 2907, 2922
Rolled out binary search method: 79, 78, 78, 78, 63
Shifty method: 250, 266, 265, 265, 265
Logical method: 156, 140, 141, 157, 141
p2Bound4 method: 78, 78, 78, 62, 78
highestOneBit method: 78, 78, 78, 94, 78
get2FoldFlattened method: 78, 79, 78, 62, 78
shiftUp method: 210, 210, 211, 210, 210

I took a moment to run the VM in server mode. I wasn’t expecting to see much change since we are only performing such minimal computations. I was wrong. Perhaps there is much more aggressive inlining occurring? I’m not sure, but all of the benchmarks were reduced, and significantly in the case of the java.lang.Math functions. Also, the p2Bound4 and highestOneBit algorithms now match the performance of the BS approaches.

Windows P4 2.8GHz
java version “1.6.0-beta”
Java™ 2 Runtime Environment, Standard Edition (build 1.6.0-beta-b59g)
Java HotSpot™ Client VM (build 1.6.0-beta-b59g, mixed mode, sharing)

>java -cp . PowerOfTwo
-- Accuracy run
All methods return the exact same results.
-- Test runs with 5000000 iterations.
Math.log method: 4329, 4328, 4375, 4390, 4391
Rolled out binary search method: 78, 78, 78, 63, 78
Shifty method: 218, 219, 234, 219, 219
Logical method: 172, 172, 172, 172, 156
p2Bound4 method: 78, 78, 78, 78, 78
highestOneBit method: 79, 78, 79, 78, 78
get2FoldFlattened method: 78, 78, 78, 78, 78
shiftUp method: 220, 215, 214, 214, 210

Running the benchmarks under the Java 6 beta produced similar results as the server mode results above, but with not so great a performance bump for the java.lang.Math routine. There is no server VM for my beta JDK download, so that test is not included.

IBM p590 Server
java version “1.4.2”
Java™ 2 Runtime Environment, Standard Edition (build 1.4.2)
Classic VM (build 1.4.2, J2RE 1.4.2 IBM AIX build ca142ifx-20050119 SR1+80507+81622 (JIT enabled: jitc))

> /usr/java14/bin/java -cp . PowerOfTwo
-- Accuracy run
All methods return the exact same results.
-- Test runs with 5000000 iterations.
Math.log method: 1271, 1269, 1240, 1212, 1208
Rolled out binary search method: 98, 96, 97, 96, 96
Shifty method: 0, 0, 0, 0, 0
Logical method: 325, 321, 324, 327, 324
p2Bound4 method: 87, 85, 87, 86, 86
highestOneBit method: 0, 0, 0, 0, 0
get2FoldFlattened method: 112, 110, 110, 110, 111
shiftUp method: 0, 0, 0, 0, 0

This is the test result for my production server environment. This hardware accelerates the simple bit-shifting algorithms (with minimal looping) to the point that I couldn’t measure them. Shifty, highestOneBit, and ShiftUp executed 5MM iterations in less than 10 ms! That’s quite a bit of optimization. The same hardware produced some of the lowest scores for the BS routines.

oravecz · April 3, 2006, 1:12pm

Here is the test harness. I think I have everyone’s tweaks in here, but I have commented out the lookup code because of verification errors. If someone can correct this and repost, it would be helpful. I copied the code from the Java 5 Integer.highestOneBit() method so it reduces a method call and also can run in JDK 1.4 VMs.

public class PowerOfTwo {


	public static final double invln2 = 1.0 / Math.log(2.0);

	/**
	 * @author Control
	 */
	public static final int get2FoldLog(int num) {
		return (int) Math.pow(2, Math.ceil(Math.log(num) * invln2));
	}

	/**
	 * @author Dirk Bosmans Dirk.Bosmans@tijd.com
	 * @author Vincent Vollers
	 */
	public static final int get2FoldUnrolled(int n) {
		if (n < 0) return 32;
		if (n > 0x00010000) {
			if (n > 0x01000000) {
				if (n > 0x10000000) {
					if (n > 0x40000000) {
						// if ( n > 0x7fffffff )
						// return 32
						// else
						// return 2147483648; is too much, so return -1
						return -1;
					} else {
						// !( n > 0x3fffffff )
						if (n > 0x20000000) return 1073741824;
						else return 536870912;
					}
				} else {
					// !( n > 0x0fffffff )
					if (n > 0x04000000) {
						if (n > 0x08000000) return 268435456;
						else return 134217728;
					} else {
						// !( n > 0x03ffffff )
						if (n > 0x02000000) return 67108864;
						else return 33554432;
					}
				}
			} else {
				// !( n > 0x00ffffff )
				if (n > 0x00100000) {
					if (n > 0x00400000) {
						if (n > 0x00800000) return 16777216;
						else return 8388608;
					} else {
						// !( n > 0x003fffff )
						if (n > 0x00200000) return 4194304;
						else return 2097152;
					}
				} else {
					// !( n > 0x000fffff )
					if (n > 0x00040000) {
						if (n > 0x00080000) return 1048576;
						else return 524288;
					} else {
						// !( n > 0x0003ffff )
						if (n > 0x00020000) return 262144;
						else return 131072;
					}
				}
			}
		} else {
			// !( n > 0x0000ffff )
			if (n > 0x00000100) {
				if (n > 0x00001000) {
					if (n > 0x00004000) {
						if (n > 0x00008000) return 65536;
						else return 32768;
					} else {
						// !( n > 0x00003fff )
						if (n > 0x00002000) return 16384;
						else return 8192;
					}
				} else {
					// !( n > 0x00000fff )
					if (n > 0x00000400) {
						if (n > 0x00000800) return 4096;
						else return 2048;
					} else {
						// !( n > 0x000003ff )
						if (n > 0x00000200) return 1024;
						else return 512;
					}
				}
			} else {
				// !( n > 0x000000ff )
				if (n > 0x00000010) {
					if (n > 0x00000040) {
						if (n > 0x00000080) return 256;
						else return 128;
					} else {
						// !( n > 0x0000003f )
						if (n > 0x00000020) return 64;
						else return 32;
					}
				} else {
					// !( n > 0x0000000f )
					if (n > 0x00000004) {
						if (n > 0x00000008) return 16;
						else return 8;
					} else {
						// !( n > 0x00000003 )
						if (n > 0x00000002) return 4;

						return n;
					}
				}
			}
		}
	} // end widthInBits

	/**
	 * @author James Cook
	 */
	public static final int get2FoldShifty(int n) {
		if (n == 0) return 0;
		int power = 0;
		n--;
		while (n > 0) {
			n = n >> 1;
			power++;
		}
		return 1 << power;
	}

	/**
	 * @author James Cook
	 */
	public static final int get2FoldLogical(int n) {
		int maxPower = 0x40000000;
		while (maxPower > 0) {
			if (n == maxPower) return maxPower;
			if (n > maxPower) return maxPower << 1;
			maxPower = maxPower >> 1;
		}
		return 0;
	}

	/**
	 * @author swpalmer
	 */
	public static final int p2Bound4(int n) {
		int h1 = n & 0xffff0000;  // 11111111111111110000000000000000
		if (h1 == 0)
			h1 = n & ~0xffff0000;

		int h2 = h1 & 0xff00ff00;  // 11111111000000001111111100000000
		if (h2 == 0)
			h2 = h1 & ~0xff00ff00;

		int h3 = h2 & 0xf0f0f0f0;  // 11110000111100001111000011110000
		if (h3 == 0)
			h3 = h2 & ~0xf0f0f0f0;

		int h4 = h3 & 0xcccccccc;  // 11001100110011001100110011001100
		if (h4 == 0)
			h4 = (h3 & ~0xcccccccc);

		int h5 = h4 & 0xaaaaaaaa;  // 10101010101010101010101010101010
		if (h5 == 0)
			h5 = (h4 & ~0xaaaaaaaa);

		// h5 is the highest set bit in 'n', so we likely need to shift by 1
		return h5 << (h5 - n >>> 31);
	}

	/**
	 * @author JDK5 Integer class by way of mabraham ("inlined" the code from Integer)
	 */
	private static final int highestOneBit(int n) {
		// already a power-of-2 value
		if ((n & (n - 1)) == 0) return n;
		n |= (n >> 1);
		n |= (n >> 2);
		n |= (n >> 4);
		n |= (n >> 8);
		n |= (n >> 16);
		n = n - (n >>> 1);
		return n << 1;
	}

	public static final int[] lookup = new int[256];

	static {
		for (int i = 0; i < lookup.length; i++)
			lookup[i] = highestOneBit(i);
	}

	public static final int lookup(int val) {
		if (val < 0)
			return -1;
		else if (val < 256)
			return lookup[val];
		else if (val < 65536)
			return lookup[val >> 8] << 8;
		else if (val < 16777216)
			return lookup[val >> 16] << 16;
		else if (val < 2147483647)
			return lookup[val >> 24] << 24;
		return -1;
	}


	/**
	 * @author HamsterOfDeath
	 */
	public static final int get2FoldFlattened(int n) {
		return n < 0
				? 0xFFFFFFFF
				: n > 0x00010000
				? n > 0x01000000
				? n > 0x10000000
				? n > 0x40000000
				? 0xFFFFFFFF
				: (n > 0x20000000) ? 0x40000000 : 0x20000000
				: n > 0x04000000 ? (n > 0x08000000) ? 0x10000000 : 0x08000000 : (n > 0x02000000) ? 0x04000000 : 0x02000000
				: n > 0x00100000
				? n > 0x00400000
				? (n > 0x00800000) ? 0x01000000 : 0x00800000
				: (n > 0x00200000) ? 0x00400000 : 0x00200000
				: n > 0x00040000
				? (n > 0x00080000) ? 0x00100000 : 0x00080000
				: (n > 0x00020000) ? 0x00040000 : 0x00020000
				: n > 0x00000100 ? n > 0x00001000
				? n > 0x00004000
				? (n > 0x00008000) ? 0x00010000 : 0x00008000
				: (n > 0x00002000) ? 0x00004000 : 0x00002000
				: n > 0x00000400
				? (n > 0x00000800) ? 0x00001000 : 0x00000800
				: (n > 0x00000200) ? 0x00000400 : 0x00000200 : n > 0x00000010 ? n > 0x00000040 ?
				(n >
						0x00000080)
						?
						0x00000100
						:
						0x00000080 :
				(n >
						0x00000020)
						?
						0x00000040
						:
						0x00000020 :
				n >
						0x00000004
						?
						(n >
								0x00000008)
								?
								0x00000010
								:
								0x00000008
						:
						(n >
								0x00000002)
								?
								0x00000004
								:
								n;

	}

	public static final int shiftUp(int n) {
		if (n == 0) return 0;
		int result = 1;
		while (result < n) result <<= 1;
		return result;
	}


	public void testPerformance(int run) {
		System.out.println("-- Test run " + run + " with " + NUM + " iterations.");

		long time = System.currentTimeMillis();
		for (int i = 0; i < NUM; i++) get2FoldLog(i);
		time = System.currentTimeMillis() - time;

		System.out.println("Math.log method: " + time);

		time = System.currentTimeMillis();
		for (int i = 0; i < NUM; i++) get2FoldUnrolled(i);
		time = System.currentTimeMillis() - time;

		System.out.println("Rolled out binary search method: " + time);

		time = System.currentTimeMillis();
		for (int i = 0; i < NUM; i++) get2FoldShifty(i);
		time = System.currentTimeMillis() - time;

		System.out.println("Shifty method: " + time);

		time = System.currentTimeMillis();
		for (int i = 0; i < NUM; i++) get2FoldLogical(i);
		time = System.currentTimeMillis() - time;

		System.out.println("Logical method: " + time);

		time = System.currentTimeMillis();
		for (int i = 0; i < NUM; i++) p2Bound4(i);
		time = System.currentTimeMillis() - time;

		System.out.println("p2Bound4 method: " + time);

		time = System.currentTimeMillis();
		for (int i = 0; i < NUM; i++) highestOneBit(i);
		time = System.currentTimeMillis() - time;

		System.out.println("highestOneBit method: " + time);

//		time = System.currentTimeMillis();
//		for (int i = 0; i < NUM; i++) lookup(i);
//		time = System.currentTimeMillis() - time;
//
//		System.out.println("lookup method: " + time);

		time = System.currentTimeMillis();
		for (int i = 0; i < NUM; i++) get2FoldFlattened(i);
		time = System.currentTimeMillis() - time;

		System.out.println("get2FoldFlattened method: " + time);

		time = System.currentTimeMillis();
		for (int i = 0; i < NUM; i++) shiftUp(i);
		time = System.currentTimeMillis() - time;

		System.out.println("shiftUp method: " + time);
	}

	public void testAccuracy() {
		System.out.println("-- Accuracy run ");

		for (int i = 1; i < 0x4000000; i++) {
			int i1 = get2FoldLog(i);
			int i2 = get2FoldUnrolled(i);
			int i3 = get2FoldShifty(i);
			int i4 = get2FoldLogical(i);
			int i5 = p2Bound4(i);
			int i6 = highestOneBit(i);
			int i7 = get2FoldFlattened(i);
			int i8 = shiftUp(i);

			if (i1 != i2 || i2 != i3 || i3 != i4 ||
					i4 != i5 || i5 != i6 || i6 != i7 || i7 != i8) {
				System.err.println("Error in results for i = " + i +
						", Log:" + i1 + ", Unrolled: " + i2 +
						", Shifty: " + i3 + ", Logical: " + i4 +
						", p2Bound4: " + i5 + ", highestOneBit: " + i6 +
						", get2FoldFlattened: " + i7 + ", shiftUp: " + i8);
				System.err.println("Stopping test. (there may be other inaccuracies.)");
				System.exit(-1);
			}
		}
		System.out.println("All methods return the exact same results.");
	}

	public PowerOfTwo() {
		testAccuracy();
		for (int i = 0; i < 5; i++) {
			testPerformance(i + 1);
		}
	}

	public static int NUM = 5000000;

	public static void main(String[] args) {
		new PowerOfTwo();
	}
}

Riven · April 3, 2006, 6:48pm

Ouch. System.currentTimeMillis() is way to inaccurate for benchmark-results under 100ms (as it might snap to 15ms values).

Java 1.4 has sun.misc.Perf to do what System.nanoTime does in 1.5+

DzzD · April 3, 2006, 7:20pm

it is true that System.currentTimeMillis() can have about 16ms error, but, if the complete test is lasting enough time,the final result will have a very good precision.

example: if the full loop lasts more than 5 seconds and your are testing 1000000 values you will only have the following error precision per op: EDIT not homogen 16ms / (5*1000000) = 3.2 ns max error

EDIT:
Total time error : 0,016/5 = 0,32%

EDIT2 false again
per op error : 0,32%/nbop=0,32%/1000000= 0,00000032%

per op error : 0,32%

conclusion making a bench with enought values will give the same result with System.currentMillis than any other precision timer ;-), (because the arror of System.currentMillis does not grow with time)

Riven · April 3, 2006, 7:34pm

Yes, but the loop takes less than 70ms…

DzzD · April 3, 2006, 7:43pm

ok, maybe it is too short even with good timer

@oravecz : thanks for developping this full bench with all differents methods

DzzD · April 3, 2006, 8:01pm

	public static final int lookup(int val) 
	{
		if (val < 65536)
		{
			if (val < 256)
			{
				if(val<0)
					return -1;
				else
					return lookup[val];
			}
			else
				return lookup[val >> 8] << 8;
		}
		else
		{
			if (val < 16777216)
				return lookup[val >> 16] << 16;
			else
				return lookup[val >> 24] << 24;
		}
	}

review of the lookup method with a better implementation of if/else statements , seems to give better results

Riven · April 3, 2006, 9:51pm

You can logicly remove a lot of ‘else’ statements from that code:


public static final int lookup(int val) 
{
   if (val < 65536)
   {
      if (val < 256)
      {
         if(val<0)
            return -1;
         return lookup[val];
      }
      return lookup[val >> 8] << 8;
   }

   if (val < 16777216)
      return lookup[val >> 16] << 16;
   return lookup[val >> 24] << 24;
}

Very tempting to inline one if/else statement:


public static final int lookup(int val) 
{
   if (val < 65536)
   {
      if (val < 256)
         return (val<0) -1 ? lookup[val];
      return lookup[val >> 8] << 8;
   }

   if (val < 16777216)
      return lookup[val >> 16] << 16;
   return lookup[val >> 24] << 24;
}

But probably just as fast

DzzD · April 3, 2006, 11:32pm

maybe but your code looks much cleaner,I am really supid to put a else statement after a return

Riven · April 3, 2006, 11:49pm

You can setup Eclipse to whine about it