13.8x faster atan2 (updated)

Roquen · May 1, 2012, 5:51am

Man I “am” the two year ago of this thread. NO! SEE: Beware of anti-optimizations.

Even if this were the case, HotSpot is pretty aggressive at inline small code chunks.

Or better yet, decide on a reasonable range and ignore input outside of it. I’d say that anything outside of [-2pi,2pi] has to be the callers problem.

ra4king · May 1, 2012, 10:17pm

StrictMath is almost never called from Math, the JVM inlines it with its own faster native implementation.

EDIT: Hehe looks like Roquen’s link already says that ^^

Jimmt · May 2, 2012, 1:24am

Did you get the idea from rainbow tables?

Zom-B · May 2, 2012, 8:29am

Who got what idea? I don’t see anything in this thread even remotely related to rainbow tables.

I didn’t know that. It isn’t flagged as native so I never expected the JVM would generate exceptions for those.

I did some quick tests, and indeed the trig functions are in the order of 80 times faster when calling Math. :cranky:

On the other hand, some functions like min/max are 35% faster when calling StrictMath (I haven’t tested others yet).

Edit: A more advanced test (using my second test type, and preventing code from being optimized away) shows now that StrictMath is about 10% faster than Math. ???

Edit2: (StrictMath is …)
sin,cos: 50% slower
tan: 10% faster
atan: about the same
sqrt: 9000% slower
cbrt: about the same
floor: about 10% slower
ceil: about the same
max,min: about the same

And I think we’re drifting too much off topic. shouldn’t someone make a new topic for this?

ra4king · May 2, 2012, 12:10pm

Or we could stop repeating the same thing over and over…and simply use Riven’s methods for fast sin/cos/tan/atan (@Riven do you have fast implementations for asin and acos?)

kappa · May 2, 2012, 12:52pm

Always seems to bother me a bit that Math.sqrt is such a super slow operation, a quick google search seems to show that there is indeed a fast integer square root using similar techniques to the above here. Should make a nice addition to the collection of fast hacky maths methods.

Roquen · May 2, 2012, 12:56pm

The issue is that optimal approximation is dependent on alot of factors and most people want one version of each function that “just works”. That’s simply not the way to maximize performance. As I’ve mentioned before I’ve never seen a table-based method which out-performs (on non-embedded systems) numeric methods. But that’s “my” use-case YMMV. If you really want fast, you have to know your input range and what kind (abs or rel error) and the bound on that error and potentially access patterns.

Roquen · May 2, 2012, 1:19pm

WRT: trig & inverse-trig. Remember what the man said: “The best optimization is to not need to call anything.” If you’re calling inverse trig functions, my question is: “do you really want the angle? I mean really, really want the angle”. For example if you’re storing your objects orientations as angles…therein lies a problem. Another is if you’re converting to polar, manipulating the angle then converting back.

WRT: Sqrt. All roots are a pain as the terms of the Talyor series decrease very slowing, so they are very hard to approximate with a polynomial or rational of polynomials. If you almost know the answer, then they are very easy (like Newton’s method).

Jimmt · May 2, 2012, 3:39pm

@Zom-B the original poster said he used a “lookup-table.” Rainbow tables are lookup tables for hash codes.

Riven · May 2, 2012, 3:55pm

[quote=“ra4king,post:45,topic:28117”]
Would you use them?

Anyway, I don’t know about their performance, but given their narrow valid range, it’s trivial to fill lookup tables.

ra4king · May 2, 2012, 6:21pm

[quote=“Riven,post:50,topic:28117”]

Haha that’s a good question, I don’t see why I would ever use acos and asin in a performance critical application ;D

Nate · May 3, 2012, 3:06am

Math.sqrt is quite fast. I once tried some of the alternatives and couldn’t find anything faster.

Roquen · May 3, 2012, 10:19am

Fast/slow. It’s all relative. To put some numbers on it in terms of instruction latency/throughput (both in cycles) for modern(ish) Intel. Ranges are to cover a reasonable set of hardware.

SQRTSD: 20-39/14-39 (one complaint double sqrt) On unspecified input ranges these can drop to ~6/~10.
SQRTSS: 23-32/23-32 (one compliant single sqrt) On unspecified input ranges these can drop to ~6/~10.
RSQRTSS: 3-6/1-4 (approximate 1/sqrt for singles. We can’t do this…big fat sadface)

So what I intended to say about sqrt being “hard” is that unlike lots of other functions reducing the range or precision to get a faster version is much more difficult. About the only way you can succeed is if you know that it’s very near a specific value (like when renormalizing). That old SGI trick (popularized by ID) should always be slower that simply using the hardware.

Zom-B · May 7, 2012, 2:56am

Rainbow tables are clever structures that eliminate the need for memory-eating lookup tables (think in the range of several terabytes)

[quote=“ra4king,post:51,topic:28117”]

How about fractals / fractal art? (actually you’d need the highest possible precision here so the statement is void)

Roquen:

Fast/slow. It’s all relative. To put some numbers on it in terms of instruction latency/throughput (both in cycles) for modern(ish) Intel. Ranges are to cover a reasonable set of hardware.

SQRTSD: 20-39/14-39 (one complaint double sqrt) On unspecified input ranges these can drop to ~6/~10.
SQRTSS: 23-32/23-32 (one compliant single sqrt) On unspecified input ranges these can drop to ~6/~10.
RSQRTSS: 3-6/1-4 (approximate 1/sqrt for singles. We can’t do this…big fat sadface)

So what I intended to say about sqrt being “hard” is that unlike lots of other functions reducing the range or precision to get a faster version is much more difficult. About the only way you can succeed is if you know that it’s very near a specific value (like when renormalizing). That old SGI trick (popularized by ID) should always be slower that simply using the hardware.

I sometimes use “float d = 1f / (float)Math.sqrt(dx * dx + dy * dy);” (more often double actually), so hopefully the compiler uses RSQRTSS there.

Zom-B · May 7, 2012, 9:59am

Roquen:

Fast/slow. It’s all relative. To put some numbers on it in terms of instruction latency/throughput (both in cycles) for modern(ish) Intel. Ranges are to cover a reasonable set of hardware.

SQRTSD: 20-39/14-39 (one complaint double sqrt) On unspecified input ranges these can drop to ~6/~10.
SQRTSS: 23-32/23-32 (one compliant single sqrt) On unspecified input ranges these can drop to ~6/~10.
RSQRTSS: 3-6/1-4 (approximate 1/sqrt for singles. We can’t do this…big fat sadface)

So what I intended to say about sqrt being “hard” is that unlike lots of other functions reducing the range or precision to get a faster version is much more difficult. About the only way you can succeed is if you know that it’s very near a specific value (like when renormalizing). That old SGI trick (popularized by ID) should always be slower that simply using the hardware.

After some googling I found out these are MMX instructions, and Java can’t use those afaik. Normal math happens using FPU instructions. I copied this from “Intel Architecture Optimization Reference Manual”:

“Additionally, while the multi-cycle floating-point instructions, fdiv and fsqrt, execute in the floating-point unit pipe, integer instructions can be executed in parallel.”

So it pays off to do lots of non-fp-math in inner-loops that use Math.sqrt and /. Separating (or keeping) those in different inner-loops would be bad.

And from “How to optimize for the Pentium family of microprocessors” I read that FSQRT uses 70 cycles and cannot overlap integer multiplication instructions (they will block the pipeline). No idea about modern processors however. They should be much more efficient in this, especially the new Core Ix and Celeron Sandy Bridge CPUs.

Roquen · May 7, 2012, 10:20am

I’m pretty sure that HotSpot does not use RSQRTSS, but even if it does it won’t approach that speed as it would have to perform some newton steps to increase the accuracy (since in java you can’t relax things like this.) It would need to do one for single and two for double (don’t quote me on this.)

HotSpot does use SSE instructions for a fair amount of FP computation…it simply uses the SISD version (*SS, *SD). All basic operations are performed in SSE.