Amazing java.lang.math performance

Hi all,
In researching for our Vector/Matrix classes and our chapter on fast java math, I have written some java.lang.math alternatives, typical techniques from the C/C++ world (look up tables, approximations, even ported a Quake3 inverse sqaure root routine ;-))
However, when attempting to benchmark, the JIT/VM is somehow doing AMAZING things with java.lang.math! Like java.lang.math gets done in less time than actually doing nothing…

Or here’s one, StrictMath produces expected execution times, where the approixamation routines are faster, but math is WAY faster. However, if you decompile java.lang.math, it CALLS StrictMath, which then calls native.

In addition, if I test using the -Xcompile java option, I get more reasonable results again.
Anyone, have any ideas? Has any written a microbenchmark that had a more realistic execution profile? I tried several techniques to prevent overly unrealistic inline,etc, such as using random numbers and feedback loops.

Thanks for any help!

Wow ! pleasant surprise indeed …

Have you read this article, BTW:

http://www.javaworld.com/javatips/jw-javatip141_p.html

That article’s finding is that j2se 1.4 math routines are
about 3 times slower than JNI invoked C routines, which
itself is about 50% slower (due to JNI overheads) than
j2se 1.3.

anyways no complaints with your results if JVM appears to
be lot smarter under the hood :slight_smile:

Might be worth trying it out with the -server option too…

Earlier this year, our plan was to make a JNI vector/matirx lib but once 1.4.2 came out we got free SSE2(SIMD) optimization on where support. This is a major cool feature for the VM. For us to make a vector lib that used SIMD (which was the original plan), we would have to get the SIMD Intel compiler (not free) and then that lib would also not be compatible anywhere else. As nasty as that was, we were willing to go that route, at least as research.

Once 1.4.2 arrived with SIMD support and we ran some tests, we were blown away. It really did a good job of improving over the previous VM for matrix operations… However, we did not test against pure C code, which perhaps we should now. We were quite happy with the initial improvement, and had another great case for Java’s JIT compiling VM. So we dropped the JNI path, it seem unnecessary now that the VM was really tight mathematically.
Auto SIMD for legacy vector code? Fantastic!
That only leaves optimizing sin/cos/sqrt, the typical gaming approixation pallette.

However, we can’t seem to get a realistic benchmark on the main java.lang.math methods, it seems like the VM is doing some serious whacky stuff with it.
Math calls StrictMath which calls native code.
BUT, in test after test, Math runs FASTER than StrictMath (even though it calls it?) and any approximation we right, even methods that just return static values!!! Now explain that one VM priests! :wink: It’s like the VM KNOWS about java.lang.math…(which I’m quite sure it does now)

You should not interpret java.lang.Math literally. While in source it looks like it calls StrictMath, it is probably only fallback for some weaker/experimental jits. AFAIK, Hotspot has exact knowledge of Math class and just plainly ignores implementation provided in rt.jar, instead generating optimized inlined code itself. Same thing happens probably for String.charAt, maybe String.length and probably System.arraycopy (which can be implemented as single memmov in case where everything about arrays is known statically).

'srite.

We dumped our vecmath stuff from LWJGL as well once we realised just how fast 1.4.2 server had become, math-wise.

Cas :slight_smile:

[quote]AFAIK, Hotspot has exact knowledge of Math class and just plainly ignores implementation provided in rt.jar, instead generating optimized inlined code itself.
[/quote]
It is cool that HotSpot can do this since Math is final it really can be just a token to the compiler. But, can you point me to any papers/articles that discuss HotSpot having explicit knowledge of Math?

I noticed that both Math and StrictMath are declared with the strictfp keyword… something I have to go lookup because I keep forgetting exactly what it does. I expected to see strictfp only on StrictMath.

It was mentioned in one of the Sun Java chat sessions that 1.4.2 does not use any SIMD instructions when it makes use of SSE2. I don’t know enough of SSE2 to really know what that implies.

Yes, please post a link to paper or presentation. I really need that reference to commit to this info(and thanks for it as well!).
I contacted Sun directly as well, but I am still waiting for info from them.
Thanks!

[quote]AFAIK, Hotspot has exact knowledge of Math class and just plainly ignores implementation provided in rt.jar, instead generating optimized inlined code itself. Same thing happens probably for String.charAt, maybe String.length and probably System.arraycopy (which can be implemented as single memmov in case where everything about arrays is known statically).
[/quote]
It is true for arraycopy() (although I can’t find any “official” word on it, this interview mentions it (“You ALWAYS want to use arraycopy. It’s really fast. The call is native, but doesn’t go through JNI. It is a special hook built into the VM.”). I wouldn’t be surprised if Math gets the same treatment.

Math definitely gets the same treatment. Math calls are converted directly into x86 FPU instructions like fsin etc. However, StrictMath calls are not because the Java VM specification specifies that StrictMath follows exact IEEE algorithms which the x86 FPU doesn’t. (I believe the x86 has more precision). The idea of StrictMath is guaranteed reproducability across VMs.

The server VM operates very differently to the client VM with regards maths. I remember running my terrain demo once and getting 10fps and being confused; then I discovered it was for some reason using StrictMath. I forget how and why now.

Cas :slight_smile:

I cannot find an official word on it at the moment. I don’t remember if I have heard it, or just have talked about such kind of optimization at one of jvm forums.

Anyway, for half-proof, please check jvm.dll and look for java/lang/Math. Around it you will see few interesting strings, like sqrt,atan2,tan,cos. Before that you can see arraycopy, identityHashCode, currentThread, getClass, Unsafe class etc. Looks very suspicious - exactly like a listing of static/final method which can implemented by jvm in a lot more efficient manner than through any JNI call…

As for the String class, I do not see any of it’s methods there. Maybe they were inside in some past versions of hotspot, but since then default jit/optimizer got good enough to emit perfect versions of them anyway - or maybe I just have imagined it.

As for ‘committing for it’… I think it is a valid optimalization, so even if it is not in current Hotspot, it can get there in future. And as far as current version is concerned, benchmark is your friend :slight_smile:

Thanks for everyone’s comments so far, they have helped a bit, as well as crossing ref-ing with a few outside sources and the JavaOne papers/presentations.

I will post our new math utils as soon as I have one that I believe is WORTH using…

Any attempt at a faster square root or inverse square root, float or double, has failed. The functions are correct (but approximates) and speedy. However, the fastest techniques use bit twisting of the floating point format, which I have implemented, unfortunately in Java this requires a method call, the float(double) to int(long) conversion calls, where in C this is a cast. Now that call surely does little more than a cast and a copy, but it is enough to make the gains not pan out. Even the Quake3 inverse sqrt (used for vector normalizing among other things) is still slower than 1.4.2 java.lang.sqrt, I suspect simply because of the function handling in Java cost.

I did manage to get some gains on Sin and Cos, using fast functions and look-up tables. With my ~sin/cos look up functions I get about ~3.15 times fast sin/cos when using degrees because the table is degree and the java.lang.math tests had a toRadians in there (this is somewhat fair since many times, we work in degrees and have to convert at run-time) In the no toRadians version, i.e. straight Math.sin(angle) to FastMath.sin(angle) test, I get > 60% speed up. However this is on VERY focused tests. When placed in a more realistic test such as Matrix.rot(x) we get around 2.05 times speed up. Anyway, when I have a nice comparison chart I’ll post the results in the source.

On the up side, against StrictMath, this stuff rocks. Up 10 times faster in some cases. So if you are using a older VM, or are in a different execution mode, such as -Xcompile or interpreted, you get great gains.

Of course the really good news is that Java math is now super-fast in the Java world. I imagine that java.lang.math will be the fastest possible way to do floating work that isn’t batched and sent to the native-side. And that is really only practical in very specialized cases. I have a few, anyone have any other? I would like to compile a list of these specialized cases and try to find the break better than even sizes.

For anyone who may need/want it, or wishes to test.

//Magic numbers ;-) Use for best guessing the start number for square root iterations.
static final float Af = 0.417319242f;
static final float Bf = 0.590178532f;
static final public float fasterSqrtF(float fp)
{
      if( fp <= 0 )
            return 0;

      int expo;
      float root;
      int bitValue = Float.floatToRawIntBits(fp);
            
      // pull out exponent
      expo = (bitValue >> 23) - 127;

      //   0x7fffff mantissa mask - clears exponent
      // 127 << 23 sets expo normalized - 0x3f800000
      bitValue &= (0x7fffff);//+ 0x3f80000);
      bitValue += 0x40000000;// ( 128 << 23 ); 

      fp = Float.intBitsToFloat(bitValue);

      // find square for decimal
      root = Af + Bf * fp;
      
      root = 0.5f * (fp/root + root);
      root = 0.5f * (fp/root + root);
      root = 0.5f * (fp/root + root);
        // iterate for more accuratcy or not.
        // inlined for speed-speed
      //root = 0.5f * (fp/root + root);
      //root = 0.5f * (fp/root + root);
      
      bitValue = Float.floatToRawIntBits(root);

      // put in exponent
      expo >>= 1;
      expo += 127;

        // assemble float bits back together
      bitValue &= 0x7fffff;
      bitValue += (expo << 23);
      
      return Float.intBitsToFloat(bitValue);
}

// converted from Quake3 code (supposedly)
static final public float fastInverseSqrt(float x)
{
      float xhalf = 0.5f*x;
      int bitValue = Float.floatToRawIntBits(x);
      bitValue = 0x5f3759df - (bitValue >> 1);
      x = Float.intBitsToFloat(bitValue);
      x = x*(1.5f - xhalf*x*x);
      x = x*(1.5f - xhalf*x*x);
      // iterate for more accuratcy or not.
        // inlined for speed-speed
        //x = x*(1.5f - xhalf*x*x);
      //x = x*(1.5f - xhalf*x*x);
      //x = x*(1.5f - xhalf*x*x);
      return x;
}


[mod]Damn, kinda ate my tabs and spaces, oh well

Yeah, the last time fast inverse square-root appeared on these boards it was beating the java.lang.Math approach in some situations but not in others. I believe it depends on your hardware and Java runtime version. I’ll try and find the post…

EDIT: Here you go: http://www.java-gaming.org/cgi-bin/JGNetForums/YaBB.cgi?board=Tuning;action=display;num=1046286297

Wow, thanks much on that forum link.
I completely missed that discussion.

Hey shawn, we have been looking at the various copyrights of third party libraries which Xith3D depends on. The only one of worry right now is vecmath. Will you be releasing your library with a BSD license?

I hope so :slight_smile:

I think right now we can release under any license we choose, so I was hoping to “give” it away as best we can.

When I got about thigh deep in this, it occured to me that instead of rewriting a new vec/mat API, I should be writing about how to properly use the Java Game Technologies vec/mat API, meaning that there should be an official one.:slight_smile:

However, even with the proper design rationale documented, I do not think that there is a one-size-fits-all solution, even with source provided. If VecMath was a seperate release, AND was open source, I might have just released some minor additions/changes, for example adding Matrix.rotXFast() type stuff. But I still have a bunch utils that are nice for many things but not in the core classes.

It is clear to me that any vec/mat API should be a core API AND utils because there are many things you will wish to do with it that you don’t care so much about optimization but DO care about correctness. For example, any tools or even in-game loaders I want BEST math, and easiest utils, but at run-time, I can go the faster less, accurate route.

To more directly answer, I would absolutely like to release for testing out in Xith3D. Finally get a chance to help out :slight_smile:

However, it’s not completely compatible with J3D VecMath. Current there are no double forms, although that’s easy enough done. Allot of the set convience methods aren’t there either (all the setElements, etc.)

I’ll post the call list (maybe the whole thing) as soon as I have an beta, which must be finished within two weeks anyway :slight_smile: