Fast Trig on the x86

Hey guys,

As soem of you may or may not know there is an issue in java where trig calculatiosn over 45 degrees or under -45 degrees ar esignificantly slower in Java then C on the x86 processor. This is due to the fact that the x86 prcoessor is broken and its trig intrinsics fail Java’s accuracy gaurantees for anything otuside of that range, forcing Java to fall back on sofwtare calculation of those functions.

I’ve just posted my first gem on my gems page, which is tested code for Sin and Cos which will take an arbitrary Sin or Cos calculation and reduce it to one in that range, ensuring speedy calculation.

Enjoy.

http://wiki.java.net/bin/view/Games/JeffGems

It would be nice if you posted some benchmarks too, against trig tables and Math.sin/cos/tan

Trig tables will beat the … out of sin/cos. (somewhere around factor 7 IIRC)

If your lookup-table is of length 100, you do:
entry = (( (int)(angle * scale) % 100) + 100) % 100;
…to catch all angles, even negative ones, without if-statements

Will post the benchmarks when I find time… at work now :-X

I just made a very simple benchmarks of those functions (loop 2000000 ops ):

		long startTime=0;
		long endTime=0;
		int time=0;
		int nbOperation=2000000;	
		double result=0.0;
		double a=3*Math.PI;
		
		startTime=System.currentTimeMillis();
		for(int n=0;n<nbOperation;n++)
		{
			result+=Math.cos(a);
			a+=10;
		}
		endTime=System.currentTimeMillis();
		time=(int)(endTime-startTime);
		System.out.println("Time=" + time);
		System.out.println("Op/s=" + (nbOperation*1000)/time);
		System.out.println("result=" + result);
		
		result=0.0;
		a=3*Math.PI;
		startTime=System.currentTimeMillis();
		for(int n=0;n<nbOperation;n++)
		{
			result+=FastTrig.FastCos(a);
			a+=10;
		}
		endTime=System.currentTimeMillis();
		time=(int)(endTime-startTime);
		System.out.println("Time=" + time);
		System.out.println("Op/s=" + (nbOperation*1000)/time);
		System.out.println("result=" + result);	

FastTrig.FastCos is about one(0.9) to ten times faster than Math.cos !! of course it depend on tests ranges and especially the parameter “a”

final result is not the same but after 2000000 cosinus additions there is just a very very little difference.

Bruno

Well… I’d expct it to be slightly slower then Math.cos for anythign in the -45/+45 degree range because there is no benefit and youa re adding a tiny handful of additional ops.

I’d be very curious to see a distribution of speed comparisons by “a”…

Okay, I wrote this little benchmark…


public class FastTrigBenchmark {
	
	 static double[][] benchmarkCos(double startvalue, double endvalue, double increment, long samples){
	    	double[][] results = new double[(int)Math.ceil((endvalue-startvalue+1)/increment)][2];
	    	int datapt = 0;
	    	for(double testval=startvalue;testval<=endvalue;testval+=increment){
	    		long time = System.nanoTime();
	    		for(int i=0;i<samples;i++){
	    			Math.cos(testval);
	    		}
	    		results[datapt][0] = time - System.nanoTime();
	    		time = System.nanoTime();
	    		for(int i=0;i<samples;i++){
	    			FastTrig.fastCos(testval);
	    		}
	    		results[datapt++][1] = time - System.nanoTime();
	    	}
	    	return results;
	    }

	/**
	 * @param args
	 */
	public static void main(String[] args) {
		double start = -Math.PI*10;
		double end = Math.PI*10;
		double incr = Math.PI/180.0;// check by degree
		//warmup
		System.out.println("Warming up");
		benchmarkCos(start,end,incr,1000); 
		//now do for real
		System.out.println("Collecting values");
		double[][] result = benchmarkCos(start,end,incr,1000); 
		// print raw output
		System.out.println("Raw Results");
		double val = start;
		for(int i=0;i<result.length;i++){
			System.out.print("PI*"+val/Math.PI+":");
			System.out.print("Cos:"+result[i][0]);
			System.out.println("FastCos:"+result[i][1]);
			val += incr;
		}
	
		System.out.println("Deltas as % (FastCos = X % Cos)");
		val = start;
		for(int i=0;i<result.length;i++){
			System.out.print("PI*"+val/Math.PI+":");			
			System.out.println((result[i][1]/result[i][0])*100+"%");
			val += incr;
		}

		System.out.println("total times as %: (FastCos = x % cos)");
		double cosAcc = 0;
		double fastCosAcc = 0;
		for(int i=0;i<result.length;i++){
			cosAcc += result[i][0];
			fastCosAcc += result[i][1];
		}
		System.out.println((fastCosAcc/cosAcc)*100+"%");

	}

}
[quote]
[/quote]
On the whole, voer the entire distribution, we're seeing about the same speed (it varies a bit for reasons Im unsure of)

What i find mreo inetresting though is the breakdown of the bands. It suggests there migth be some subtle tuning I can do yet...

Im trying to get Star office to graph it. if I can make that work ill post.

Intresting! i must admit im having a bit of trouble figuring out how your lookup works… just trouble picturing it in my head…

But it sure looks more efficient! Are you assuming a 0-360 degree table?

Edit:Oh ,I got it, most of that is jsut to movve the index out of the negs :slight_smile:
What is “scale” in the above?

[quote]Well… I’d expct it to be slightly slower
[/quote]
yes about 0.9

[quote]re adding a tiny handful of additional ops
[/quote]
additional ops are negligable with cos (eg: += can be performed about 60000000/s when cos is about 1000000), and you need a value that is updated in the loop and used outside to be sure that optimiser will not change benchmark result.

it seem that if you calculate random cos and sin fasttrig become faster (in the outside range) but if you calculate continuous cos like cos,0.1 cos,0.2cos etcc (a+=0.1) it become equals (a bit less 0.9 due to over op)

maybe a cache or something like that (serie cache)

Bruno

Good catch. Il have to rewrite a gaussain random benchmark at some point…

I use “Yakly degrees” myself - instead of needing to do % modulus ops I can just bitmask 'em instead - there are conveniently 0x10000 Yakly degrees in a circle :wink:

One thing that people using LUTs forget is that it can be a very, very expensive operation if you’re not lucky, as if your LUT finds itself outside the L1 cache, or worse, the L2 cache, you’re looking at a bus wait. Aggh!

Cacs :slight_smile:

Ummm… wassat? :wink:

If i understand it, its an arbitrary angle division that happens to fit into base 2 well.

Look at it this way: A circle is…

360 degrees
or
(approximately)6.28 Radians
or
0x10000 Yaklys

Get it? :slight_smile:

If you use table look-ups, you should linearize between look-up points to ensure that the curve is continuous. This should be relatively cheap.

FastTrig is stable 20-30% slower than conventional Math.cos()/sin() on my environment.
btw, I have used values between -10000 … 10000 rads for angle.

Conventional math: 0.577s
Fast math: 0.794s


/**
 * User: Andrei_Karneyenka
 */
public class TrigTest extends TestCase {


    public void testSinCos() throws Exception {

        double[]values = new double[2000];

        for (int i = 0; i < values.length; i++) {
            values[i] = 10000 - Math.random() * 20000;
        }

        TimerCounter timer = new TimerCounter();
        for (int i = 0; i < values.length; i++) {
            for (int j = 0; j < values.length; j++) {
                double value = values[j];
                Math.cos(value);
                Math.sin(value);
            }
        }
        System.err.println("Conventional math: "+timer.getElapsedTime());

        for (int i = 0; i < values.length; i++) {
            for (int j = 0; j < values.length; j++) {
                double value = values[j];
                Trig.cos(value);
                Trig.sin(value);
            }
        }
        System.err.println("Fast math: "+timer.getElapsedTime());

    }
}

This has already been covered. Linearly advancing benchmarks are a special case for Math.cos/sin and much faster then random data points.

I have a note on this at the end of the article.

Seperately I dont see any warm-up on your benchmark.

Benchmarks without warm-up are fairly useless on J2SE due to the impact of the JIT, both in time burned for compilation and the subsequent speed up.

Depending on exactly when the JIT chose to do compilation and on-stack repalcement your benchmark would produce wildly different results.

Writing good micro-benchmarks under J2SE is a rather difficult and specialized art requiring a lot of understanding of what the VM is doing underneath.

Let say you want 10 accurate values between an angle of 4 and 5 (thus: 4.0, 4.1, …, 4.9) (doesn’t matter whether those are degrees or radians)
For simplicity sake we do it in degrees here:
we have 10 * 360 (3600) entries in the lookup-table
we then do the following:

public static final float fastSin(float degrees) {
   return lookupTable[  (((int)(degrees * 10) % 3600)+3600)%3600  ];
}

… IIRC :slight_smile:

“we have 10 * 360 (5760) entries in the lookup-table”
Where did you get this 5760 entries in an array table if we have 0-360 degrees with 10 step accurate scale?

I appearantly like to change some values in my examples without changing everything else.

I started with 16 * 360… will change it.

Personally I just use lookup tables for acos (which seems horrendously slow) - sin and cos don’t cause me enough problems for me to need LUTs for them.