13.8x faster atan2 (updated)

Riven · April 26, 2012, 11:09pm

Both are not effected equaly, for the following reason:

Let’s argue that A is 10x as fast as B (A is 100ms, B is 1000ms).

Now we’re going to add an overhead of 250ms: (let’s name it: ‘random number calculation overhead’)
benchmark of A = 100ms + 250ms = 350ms
benchmark of B = 1000ms + 250ms = 1250ms
-> conclusion: “A is 3.57x as fast as B”

Let’s instead add an overhead of 100ms: (let’s call it: ‘array lookup overhead’)
benchmark of A = 100ms + 100ms = 200ms
benchmark of B = 1000ms + 100ms = 1100ms
-> conclusion: “A is 5.50x as fast as B”

Finally, add an overhead of 10000ms:
benchmark of A = 100ms + 10000ms = 10100ms
benchmark of B = 1000ms + 10000ms = 11000ms
-> conclusion: “A and B are about as fast”

The magnitude of the overhead matters, and can’t be considered to be a factor that is ‘billed equally’.

sproingie · April 26, 2012, 11:15pm

Oh nuts, you’re right. I would have to run a control and subtract it from both.

Riven · April 26, 2012, 11:17pm

It’s funny how we’re solving complex logic problems, and once in a while screwup with the basics :-*

Zom-B · April 27, 2012, 4:23am

In the past I have found a very elegant and reliable way to do microbenchmarking while compensating for the effects of overhead.

Calculate the time it takes one loop without the function being tested, and one with.

	public static void main(String[] args)
		int samples = 1000000;
		long l1;
		long l2;
		int n = 0;

		for (int tries = 0; tries < 10; tries++)
		{
			l1 = System.nanoTime();
			for (int i = 0; i < samples; i++)
			{}
			l1 = System.nanoTime() - l1;

			l2 = System.nanoTime();
			for (int i = 0; i < samples; i++)
			{
				n++;
			}
			l2 = System.nanoTime() - l2;

			System.out.println("t0 = " + l1 + "\tt=" + l2 + "\tspeed=" + samples * 1e9f / (l2 - l1));
		}
	}

Result:
t0 = 1251556 t=779708 speed=-2.11932659E9 t0 = 548115 t=838654 speed=3.44187878E9 t0 = 546159 t=752889 speed=4.837227E9 t0 = 531632 t=784457 speed=3.95530496E9 t0 = 523251 t=883911 speed=2.77269453E9 t0 = 546159 t=807924 speed=3.82022042E9 t0 = 543924 t=804013 speed=3.84483763E9 t0 = 545879 t=803454 speed=3.88236442E9 t0 = 543365 t=804013 speed=3.83659187E9 t0 = 546159 t=804012 speed=3.87817856E9

My 2.5GHz Intel Core (duo) can do about 3.88 billion times n++ in a single thread.

When removing n++ (making both loops the same), the result is:
t0 = 1259099 t=602311 speed=-1.52256128E9 t0 = 544762 t=556216 speed=8.730574E10 t0 = 551467 t=540851 speed=-9.4197441E10 t0 = 666007 t=552864 speed=-8.8383724E9 t0 = 545041 t=553982 speed=1.11844311E11 t0 = 545600 t=556496 speed=9.1776795E10 t0 = 546159 t=603708 speed=1.73764956E10 t0 = 561245 t=540571 speed=-4.8369934E10 t0 = 553701 t=542527 speed=-8.9493463E10 t0 = 551467 t=607619 speed=1.78088038E10

which are ridiculously high numbers (30 times higher) meaning this overhead compensation is fairly reliable relative to the measuring noise. (increasing samples increases accuracy)

Now on to atan2.

	public static void main(String[] args)
	{
		Random RND = new Random();

		int samples = 1000000;
		long l1;
		long l2;
		long l3;
		float x;
		float y;

		for (int tries = 0; tries < 10; tries++)
		{
			l1 = System.nanoTime();
			for (int i = 0; i < samples; i++)
			{
				x = (RND.nextFloat() - 0.5f) * 200;
				y = (RND.nextFloat() - 0.5f) * 200;
			}
			l1 = System.nanoTime() - l1;

			l2 = System.nanoTime();
			for (int i = 0; i < samples; i++)
			{
				x = (RND.nextFloat() - 0.5f) * 200;
				y = (RND.nextFloat() - 0.5f) * 200;
				FastMath.atan2(x, y);
			}
			l2 = System.nanoTime() - l2;

			l3 = System.nanoTime();
			for (int i = 0; i < samples; i++)
			{
				x = (RND.nextFloat() - 0.5f) * 200;
				y = (RND.nextFloat() - 0.5f) * 200;
				StrictMath.atan2(x, y);
			}
			l3 = System.nanoTime() - l3;

			float fast = samples * 1e9f / (l2 - l1);
			float slow = samples * 1e9f / (l3 - l1);
			System.out.println("speed fast=" + fast + "\tspeed slow=" + slow + "\tfactor=" + fast / slow);
		}
	}

result:
speed fast=2.5838188E7 speed slow=3102130.2 factor=8.329176 speed fast=4.2304908E7 speed slow=6922386.0 factor=6.111319 speed fast=3.6428208E7 speed slow=6051220.5 factor=6.019977 speed fast=3.3007322E7 speed slow=5712438.0 factor=5.7781496 speed fast=3.6745696E7 speed slow=6234034.0 factor=5.8943686 speed fast=4.6704136E7 speed slow=6152503.5 factor=7.5910783 speed fast=3.9707864E7 speed slow=6488173.5 factor=6.1200376 speed fast=5.0430332E7 speed slow=6682681.5 factor=7.5464215 speed fast=4.770246E7 speed slow=6730517.0 factor=7.087488 speed fast=4.857506E7 speed slow=6979997.5 factor=6.9591804

I’ve also invented another, even more accurate, but more limited, microbenchmarking technique. Call the function once in the first loop and 10 times in the other, then solve for the run time of one execution. This is not applicable for atan2 because you can’t call it multiple times in a row and expect the performance to be flat (and also calling random in between makes the technique useless).

65K · April 27, 2012, 7:08am

Zom-B:

In the past I have found a very elegant and reliable way to do microbenchmarking while compensating for the effects of overhead.

Calculate the time it takes one loop without the function being tested, and one with.
	public static void main(String[] args)
		int samples = 1000000;
		long l1;
		long l2;
		int n = 0;

		for (int tries = 0; tries < 10; tries++)
		{
			l1 = System.nanoTime();
			for (int i = 0; i < samples; i++)
			{}
			l1 = System.nanoTime() - l1;

			l2 = System.nanoTime();
			for (int i = 0; i < samples; i++)
			{
				n++;
			}
			l2 = System.nanoTime() - l2;

			System.out.println("t0 = " + l1 + "\tt=" + l2 + "\tspeed=" + samples * 1e9f / (l2 - l1));
		}
	}

Measuring such loops does not mean a lot, because as being dead code, they are subject for elimination.
Compare runs with client and server vm.

Roquen · April 27, 2012, 8:04am

The only situation where I’d use large tables for an approximation is if the target is interpreted. Data-flow is too big an issue in modern architectures. sproingie’s first test is the most reasonable so far and still it shows the “fast” version in too good a light under reasonable usage. It succeeds in causing branch predictors to “stub their toes” and higher level caches needing to refill BUT lower level caches will very quickly have all the data. The second test is a step backwards. Note that if you think too deeply about this stuff it will really make your head hurt because (for instance) compiled versions under light temporal access patterns the version that needs to load the least data (including code) will probably be the winner…and you also have to note that loading data to core A can have an impact on the performance of core B that might be wanting to access the memory architecture as well.

If you need to reseed, then either the test is broken or random is. The only reason to reseed is to keep people from complaining about the test not being “fair”…and a couple lines of code makes it worth not needing to explain some statistic.

So what your doing is causing the hardware version to be drastically slowed down by introducing memory stalls (that code chunk has virtually nothing to do to hide the stall). The impact on the “fast” version will be less marked: out-of-order execution, hiding of stalls, etc. This seems to me to be a highly unlikely situation: input from linear arrays of random data. Even if the input did looks somewhat like this, it seems like there would be other work happening at the same time (again same points: out-of-order, hiding stalls, etc.)

Yes dead code is useless, but don’t forget legal transforms as well. So this bias computation could be off:


        l1 = System.nanoTime();
         for (int i = 0; i < samples; i++)
         {
            x = (RND.nextFloat() - 0.5f) * 200;
            y = (RND.nextFloat() - 0.5f) * 200;
         }
         l1 = System.nanoTime() - l1;

can be legally transformed into:


        l1 = System.nanoTime();
         for (int i = 0; i < samples; i++) {
            RND.nextFloat();
            RND.nextFloat();
         }
         l1 = System.nanoTime() - l1;

WRT: StrictMath - this package is bordering on useless. At least I can’t think of a valid place to use it.

Chromanoid · April 27, 2012, 9:11am

For microbenchmarking you might want to look at http://code.google.com/p/caliper/

sproingie · April 27, 2012, 3:42pm

The reason I would need to re-seed is to get the same random data so that the “sum” is meaningful beyond avoiding the loop being optimized out, i.e. to compare the accuracy of the two methods. Otherwise, I’d hope that 10 million randoms are distributed well enough.

As for the second, transforming linear arrays of numeric data is the bread and butter of visual algorithms. If you’re computing thousands of normals, it’s going to have pretty similar access patterns (different trig functions of course). It’s rarely going to be as tight as a microbenchmark, but no one ever called microbenchmarks the last word.

As for StrictMath, it’s being used to initialize the table to give it predictable behavior across platforms, that’s all. Seems the most appropriate place to use it, where you’re only paying the cost of using it once up front.

Cero · April 27, 2012, 4:42pm

duuuuh, so is the new atan2 now faster or not ? =D

Roquen · April 28, 2012, 7:21pm

But the sum is useless at testing accuracy. The only thing it can tell you if it’s really your approximation is really broken and nothing else.

In which case there is virtually always other exploitable information.

Bit exact results for floats is crazy and was never the intent. Remember in this case that we’re talking about singles, so the results well be bit exact regardless.

sproingie · April 29, 2012, 2:17am

If it summed absolute values, then it would be a reasonable indicator of cumulative error compared to the reference implementation. It’s not the most useful thing in the world, but I wouldn’t call it completely useless. You need something to keep the loop from being optimized out anyway, may as well have some value.

Yeesh, it’s not even my algorithm, don’t ask me to die on the hill of defending a throwaway microbenchmark. :

Roquen · April 29, 2012, 6:22am

Hehe. Remember that when I’m responded that I’m not only addressing the person I’m responding to. Microbenchmarks are hard to do right. Doubly so when attempting to compare a software implementation vs. hardware. WRT: error analysis. This just isn’t the way to draw any conclusions. A sum of absolute errors significantly better but still not very meaningful. Like in Cas’s rant about overusage of test units, doing sum of errors is like attempt to show correctness by a bunch of black box tests when some small number of white box tests will tell you the exact information. If you don’t know how to do that, then for singles you can brute force it by walking all values in the range using ulp or nextAfter.

For folks that don’t get why a sum of errors is useless consider:

f(x) = if (x<0) return 0 else return 1;
a(x) = return 1/2;

or

f(x) = cos(x)
g(x) = sin(x)

DrZoidberg · April 29, 2012, 9:37am

Sin and cos are very fast between -PI/4 and +PI/4 because modern CPUs have instructions for sin and cos built in and the JVM uses that. However outside that range the CPU instructions aren’t always perfectly accurate so the JVM calculates sin and cos in software instead. That gives us a very easy way to write a fastsin and fastcos method that’s much better than any lookup table.

public static double sin(double x) {
    while(x > PI) x -= 2*PI;
    while(x < -PI) x += 2*PI;
    if(x >= -PI/4 && x <= PI/4) return Math.sin(x);
    else if(x > PI/4 && x < 3*PI/4) return Math.cos(x-PI/2);
    else if(x >= 3*PI/4 && x <= PI) return Math.sin(PI-x);
    else if(x < -PI/4 && x > -3*PI/4) return -Math.cos(x+PI/2);
    else if(x <= -3*PI/4 && x >= -PI) return -Math.sin(PI+x);
    else return Math.sin(x);
}

Roquen · April 29, 2012, 10:18am

At the risk of sounding like a two year old: “no!” I know of no architecture where sin/cos is software or hardware based where this wouldn’t be much slower.

princec · April 29, 2012, 3:25pm

It will be slower; but more accurate. It’s true that the x86 FP code is only accurate to PI/8 radians and beyond that it tends to drift. Or at least, so I read, some years ago. So if you want fast and accurate, you’ve got to tailor your input by design.

Cas

DrZoidberg · April 29, 2012, 4:38pm

Did you test that theory?
Here is my benchmark.

import static java.lang.Math.PI;

public class Trig {
    public static void main(String[] args) {
        long start = System.nanoTime();
        double test = 0; //to prevent the JVM from
                         //optimising the loop away
        for(int i=0; i<100000000; i++) {
            test += sin(3.15);
        }
        long end = System.nanoTime();
        System.out.println("Time = "+((end-start)/1e6)+"ms");
        System.out.println(test);
        start = System.nanoTime();
        test = 0;
        for(int i=0; i<100000000; i++) {
            test += Math.sin(3.15);
        }
        end = System.nanoTime();
        System.out.println("Time = "+((end-start)/1e6)+"ms");
        System.out.println(test);
    }
    
    public static double sin(double x) {
        while(x > PI) x -= 2*PI;
        while(x < -PI) x += 2*PI;
        if(x >= -PI/4 && x <= PI/4) return Math.sin(x);
        else if(x > PI/4 && x < 3*PI/4) return Math.cos(x-PI/2);
        else if(x >= 3*PI/4 && x <= PI) return Math.sin(PI-x);
        else if(x < -PI/4 && x > -3*PI/4) return -Math.cos(x+PI/2);
        else if(x <= -3*PI/4 && x >= -PI) return -Math.sin(PI+x);
        else return Math.sin(x);
    }
}

the output(using Java 1.7.0_03 64bit):

Time = 81.661547ms
-840724.7369070747
Time = 2811.498313ms
-840724.7369070747

Spasi · April 29, 2012, 8:18pm

I’m seeing the performance difference in that range too. But if it were so simple, the built-in sin/cos would do the same thing you do (ignoring numerical accuracy issues). Of course it’s not that simple, your code has almost half performance with randomized input. Which is the general case that JDK methods are optimized for.

Here’s a minimum set of rules that micro-benchmarks need to obey in order to be useful:

Do something useful with every loop result, to avoid no-op code that gets optimized away.
Use randomized input, perhaps biased a bit towards the most frequent input range.
Don’t use Random in the benchmarking loop, generate input ahead of time and store it in an array. Preferably the array should easily fit in the CPU caches, use % addressing if necessary for long loops.
Extract the loop code in its own method and call it enough times to warm it up, before timing anything. OSR can be sub-optimal.

Roquen · April 30, 2012, 6:39am

Although the benchmark wasn’t properly formulated, there’s still something rotten in Denmark. I’m away from home for another week and will have to look at this once I’m back. The gap in the numbers is way too large. If hotspot is using x87 and x87 sin/cos have changed to spitting out some microcodes to reduce die-footprint (bad me for not keep up if that’s the case) then an SSE argument reduction should be used. Actually I rather hope that sin/cos are implemented in SSE as that allows for scheduling with surrounding code. Either case argument reduction shouldn’t have this large of an impact on execution. Even assuming that hotspot has range analysis and the fast version is getting optimized into a sin/cos(x+CONST) call.

Riven · April 30, 2012, 4:23pm

The while loop could be replaced with some statement involving two modulos. Branching is way too expensive here.

Zom-B · May 1, 2012, 5:16am

A side side note. At least on my implementation, Sun/Oracle Java’s for Windows, in rt.jar (the part of the JVM that is in pure Java) all Math.xxx methods delegate to StrictMath.xxx. So directly calling StrictMath.xxx saves one call in performance.