FastMath: hypot(int,int)

Riven · March 12, 2013, 7:33am

Using a lookup table it is possible to vastly outperform Math.hypot(x,y), as long as the input values are integers and they are in a limited range (in this case: -512…+511). Within this range, we observe a factor 40 speedup and an accuracy of 100%, while outside of this range, it’s still a respectable factor 4, with a worst case accuracy of 98%.

This code is used to speedup A* cost heuristic.


public class FastMath
{
	private static final int HYPOT_LUT_SHIFT = 9; // 2^9=512
	private static final int HYPOT_LUT_MAX_RANGE = 1 << HYPOT_LUT_SHIFT;
	private static final float[] HYPOT_LUT;
	private static final int NARROW_DOWN_MUL = 9;
	private static final int NARROW_DOWN_DIV = NARROW_DOWN_MUL + 1;
	private static final float NARROW_DOWN_FACTOR = (float) NARROW_DOWN_DIV / NARROW_DOWN_MUL;

	static {
		class Helper {
			public float[] compute() {
				float[] arr = new float[HYPOT_LUT_MAX_RANGE * HYPOT_LUT_MAX_RANGE];
				EasyMath.init(arr);
				return arr;
			}
		}

		long t0 = System.nanoTime();
		HYPOT_LUT = new Helper().compute();
		long t1 = System.nanoTime();
		System.out.println("EasyMath.LUT calculation took " + ((t1 - t0) / 1000000) + "ms");

		//verify();

		//bench();
	}

	public static float hypot(int dx, int dy) {
		int ix = Math.abs(dx);
		int iy = Math.abs(dy);

		float mul = 1.0f;
		while (((ix | iy) >> HYPOT_LUT_SHIFT) > 0) {
			// narrow down, reasonably accurately
			ix = ix * NARROW_DOWN_MUL / NARROW_DOWN_DIV;
			iy = iy * NARROW_DOWN_MUL / NARROW_DOWN_DIV;
			mul *= NARROW_DOWN_FACTOR;
		}
		return HYPOT_LUT[(iy << HYPOT_LUT_SHIFT) | ix] * mul;
	}

	private static void init(final float[] lut) {
		if (false) {
			WorkerPool.distibute((Object[]) null, HYPOT_LUT_MAX_RANGE, WorkerPool.THREAD_COUNT, new WorkerPool.TaskArrayOperator<Object>() {
				@Override
				public void operate(int taskid, Object[] data, int off, int end) {
					for (int dx = off; dx < end; dx++) {
						for (int dy = 0; dy < HYPOT_LUT_MAX_RANGE; dy++) {
							lut[(dy << HYPOT_LUT_SHIFT) | dx] = (float) Math.hypot(dx, dy);
						}
					}
				}
			});
		} else {
			for (int dx = 0; dx < HYPOT_LUT_MAX_RANGE; dx++) {
				for (int dy = 0; dy < HYPOT_LUT_MAX_RANGE; dy++) {
					lut[(dy << HYPOT_LUT_SHIFT) | dx] = (float) Math.hypot(dx, dy);
				}
			}
		}
	}

	private static void verify() {
		final int ext = 3;

		for (int dx = 0; dx < HYPOT_LUT_MAX_RANGE; dx++) {
			for (int dy = 0; dy < HYPOT_LUT_MAX_RANGE; dy++) {
				float d1 = hypot(dx, dy);
				float d2 = (float) Math.hypot(dx, dy);
				if (d1 != d2) {
					throw new IllegalStateException(dx + "," + dy + " --> " + d1 + ", " + d2);
				}
			}
		}

		for (int dx = -HYPOT_LUT_MAX_RANGE + 1; dx < HYPOT_LUT_MAX_RANGE; dx++) {
			for (int dy = -HYPOT_LUT_MAX_RANGE + 1; dy < HYPOT_LUT_MAX_RANGE; dy++) {
				float d1 = hypot(dx, dy);
				float d2 = (float) Math.hypot(dx, dy);
				if (d1 != d2) {
					throw new IllegalStateException(dx + "," + dy + " --> " + d1 + ", " + d2);
				}
			}
		}

		float maxError = 0.98f;
		for (int dx = -HYPOT_LUT_MAX_RANGE * ext; dx < HYPOT_LUT_MAX_RANGE * ext; dx++) {
			for (int dy = -HYPOT_LUT_MAX_RANGE * ext; dy < HYPOT_LUT_MAX_RANGE * ext; dy++) {
				float d1 = hypot(dx, dy);
				float d2 = (float) Math.hypot(dx, dy);
				if (d1 / d2 < maxError || d2 / d1 < maxError) {
					System.err.println(dx + "," + dy + " --> " + d1 + ", " + d2 + " (" + d1 / d2 + ")");
				}
			}
			System.out.println("dx=" + dx);
		}
	}

	private static void bench() {

		final int ext = 3;

		float sum0 = 0.0f;
		float sum1 = 0.0f;
		float sum2 = 0.0f;

		int ops0 = 0;
		int ops1 = 0;
		int ops2 = 0;

		System.out.println("-");
		long t0 = System.nanoTime();
		for (int dx = -HYPOT_LUT_MAX_RANGE * ext; dx < HYPOT_LUT_MAX_RANGE * ext; dx++) {
			for (int dy = -HYPOT_LUT_MAX_RANGE * ext; dy < HYPOT_LUT_MAX_RANGE * ext; dy++) {
				sum0 += hypot(dx, dy);
				ops0 += 1;
			}
		}
		System.out.println("a");
		long t1 = System.nanoTime();
		for (int dx = -HYPOT_LUT_MAX_RANGE + 1; dx < HYPOT_LUT_MAX_RANGE; dx++) {
			for (int dy = -HYPOT_LUT_MAX_RANGE + 1; dy < HYPOT_LUT_MAX_RANGE; dy++) {
				sum1 += hypot(dx, dy);
				ops1 += 1;
			}
		}
		System.out.println("b");
		long t2 = System.nanoTime();
		for (int dx = -HYPOT_LUT_MAX_RANGE + 1; dx < HYPOT_LUT_MAX_RANGE; dx++) {
			for (int dy = -HYPOT_LUT_MAX_RANGE + 1; dy < HYPOT_LUT_MAX_RANGE; dy++) {
				sum2 += (float) Math.hypot(dx, dy);
				ops2 += 1;
			}
		}
		System.out.println("c");
		long t3 = System.nanoTime();

		System.out.println("EasyMath.LUT out of range: " + (int) (ops0 / ((t1 - t0) / 1000000.0)) + "K ops/sec [" + sum0 + "]");
		System.out.println("EasyMath.LUT in range: " + (int) (ops1 / ((t2 - t1) / 1000000.0)) + "K ops/sec [" + sum1 + "]");
		System.out.println("EasyMath.LUT native: " + (int) (ops2 / ((t3 - t2) / 1000000.0)) + "K ops/sec [" + sum2 + "]");
	}
}


EasyMath.LUT calculation took 65ms
EasyMath.LUT out of range:   4861K ops/sec [1.10062418E10]
EasyMath.LUT in range:      57928K ops/sec [4.09633024E8]
EasyMath.LUT native:         1432K ops/sec [4.09633024E8]

princec · March 12, 2013, 8:44am

By way of a little thinking about this optimisation -

We were running our Battledroid sim nonvisually, flat out. In the sim there are 1,000 G101 battledroids looking for each other on a 256x256 tile map, and blasting at each other with Standard Issue Puppygames Ordinary Blasters. The sim would run in approximately 12 seconds or so before a victorious side was declared. When put through the profiler we discovered an astonishing 75% of CPU time was taken up with Math.hypot(); about a third of which was just some foolish range checking code that was using hypot() when a squared distance would have done and easily optimised away, but the other two thirds was used by the getDistance() method of our A* pathfinder which the droids use to find each other.

Math.hypot() has several built-in disadvantages compared to a more specialised version. Firstly, it has to operate with double-precision accuracy, and it’s running using strictfp here on my Java7 Server VM, due to its specifications (this is good; we need absolutely deterministic behaviour between VMs). Secondly, it has to operate on a totally arbitrary range of double-precision floating point inputs, which means it can’t take any shortcuts at all and needs to work everything out properly every time it’s called.

Our requirements are much more simple. We only use integer coordinates, and we’ve got a pretty restricted range of possible input values as well. It is this that allows us to use a (pretty massive) look up table (in our case we’re using a 16mb table!) Even waiting for RAM to page in and out we’re still vastly quicker than waiting for the extremely accurate Math.hypot() to calculate.

Finally I’ve got a suspicion that from tick to tick the sequence of values passed in to the hypot() function change very little and tend to be clustered. My suspicions lead me to think that we’re actually not thrashing our L2 cache that much at all to keep all the relevant pages in RAM. I wonder if there is a tool we can use that will show us exactly how we’re hammering the caches. Anyway, that accounts for the 40x speedup in most cases, I think.

The same sim now runs in approximately 3 seconds.

Cas

Mike · March 12, 2013, 8:50am

Did you try it with the Apache FastMath library as well? I kept running into hypot/pow performance issues as well and that one pretty much solved all my issues. It might not be 40 times as quick but maybe it’d beat the 4 times out of the range and wouldn’t require the 16mb.

Nevertheless, another piece of nice code Riv

Mike

delt0r · March 12, 2013, 8:51am

Good part of java Math performance is also the requirement of being accurate to 2ulp. This costs quite a bit and equivalent C would require a few compiler flags to be compliant.

Thou for 2d stuff I wrote a quick and dirty game that used complex numbers for angles. This meant i never ever needed to call a any trig function at all and almost never a sqrt. I also had a all integer version.

Riven · March 12, 2013, 8:53am

Apache FastMath.hypot(x,y) uses Math.sqrt(n)

Mike · March 12, 2013, 8:55am

Indeed, apparently Math.sqrt is quicker than Math.hypot

Mike

Riven · March 12, 2013, 8:55am

We don’t use hypot for angles.

Riven · March 12, 2013, 9:15am


EasyMath.hypot:  57928K ops/sec [4.09633024E8]
Math.hypot:       1432K ops/sec [4.09633024E8]
Math.sqrt:       65652K ops/sec [4.09633024E8]

:emo:


public class FastMath
{
	public static float hypot(float dx, float dy) {
		return (float) Math.sqrt(dx * dx + dy * dy);
	}
}

sigh

princec · March 12, 2013, 9:38am

Worth pointing out we used hypot() in the first place because of our need for strict accuracy between VMs Turns out it’s a bit too accurate for its own good.

Cas

Mike · March 12, 2013, 9:38am

Sorry Riv, I still think you’re awesome though! :point:

Mike

delt0r · March 12, 2013, 9:47am

The whole point of hypot is to be more accurate than Math.sqrt(xx+yy) and to avoid overflow. IIRC by internally using something like a=y/x;Math.abs(x)Math.sqrt(1+aa). It is not suppose to be fast.

My point with the complex number thing was that avoid normalization or hypot functions a lot as well. Thou its never been a bottle neck anyway. Integers was for lock step multiplayer. Not convinced just using doubles wouldn’t be easier tbh.

Mike · March 12, 2013, 9:49am

The FastMath Library by Apache has some code in their hypot to avoid overflow. On the other hand it isn’t as accurate indeed, but I can recommend it to everyone here as it speeds things up a lot.

Mike

Roquen · March 12, 2013, 10:24am

Overflow AND underflow…neither of which are a concern here. hypot is one of those functions with a high stink factor.