Beware of anti-optimizations

I’ve noted a fair amount of anti-optimizations in some library code. Remember that just because a source file shipping with Java has an implementation doesn’t mean that it will be called on any hardware that you care about. Examples:

Using SWAR or Hacker’s Delight bit twiddling for stuff based on leading zeroes, trailing zeros or population count (floorlog2, nextpow2, etc, etc). ARM and Intel natively support these opcodes and HotSpot treats them as intrinsics.

Directly calling StrictMath instead of Math. StrictMath is always software implementation. Math equivalents are instrinics and don’t really forward to StrictMath (ignore ARM no-fp here…but regardless, the software Math version should be much faster)

Follow-up (couldn’t find this info straight away) you can see the (I think full) set of HotSpot intrinsics in the following files:

http://hg.openjdk.java.net/jdk8/awt/hotspot/file/d61761bf3050/src/share/vm/classfile/vmSymbols.hpp
http://hg.openjdk.java.net/jdk8/awt/hotspot/file/d61761bf3050/src/share/vm/opto/library_call.cpp

Follow-up 2: Since some people actually read this, I’ll augment it with slightly more accurate info.

HotSpot has a set of classes of which it’s aware and does some special case stuff with. Here we’re only concerned with methods. One thing it can do is specify a “patch-out” method, which is simply a native method which replaces any java-implemented version (or the method could also be marked native as well…don’t know of any specific cases where this is true). In that case the native routine is called (without JNI overhead) instead of compiling and using the Java based one. Why do this? One big reason is making it easy to port as adding patch-outs can be deferred until needed and until then the software ‘fall-back’ should just work. Another is regression testing if some part of the compiler base goes hay-wire during development. The second thing it can do is mark a method as “intrinsic” (will probably always have a patch-out version as well). The mean that the compiler is actually aware of method as-if it were a built-in function, which allows more more aggressive optimizations to be performed.

Also above I incorrectly state that StrictMath is always software based. I forgot that some of the methods require properly rounded results and as such if a native implementation does always return bit-identical results for all input then it can be patched-out (again I have not check that this ever occurs).

StrictMath is faster than Math == long is faster than int. That should be obvious from the name. >_>

It’s not. See this post.

long vs int? They both have the exact same methods…

http://t3.gstatic.com/images?q=tbn:ANd9GcS-ktsCkL0yh9aVvxcq37yoJshSM-y4jiE9ZL1aJpMmEj643i9uSVtbri3U6g

Am I missing something? O_o

:wink:

StrictMath has more features than Math. It’s more precise —> slower.
Long has more features than int. It has a larger range of values —> slower.

Okay, not the best comparison, but that´s not really the point. Math is supposed to be optimized with native code, while StrictMath ensures that you get the exact same result on every computer that runs it. From just that we can easily draw the conclusion that Math >= StrictMath in performance. They may be equally fast if the native Math version produces the same value as the StrictMath version in which case both should use the native version. However, if StrictMath is faster it means that it´s more precise (since it´s not the same function as the Math one if it´s slower) AND faster it should be treated as a bug in the VM. Math should never be slower than StrictMath.

Even the long/int comparison holds here. Future/current (???) CPUs might be able to do 64-bit math in a single cycle, but a long should NEVER be faster than an int.

Well naturally 64 bit CPUs already do perform operations on longs in just as many ticks as for ints (and shorts, and bytes). It’s just the memory (cache) bandwidth is quickly saturated with longs, for obvious reasons.

Yes exactly. A fair number of the SISD (in SSEx) ops on doubles and floats have the same numbers, but data-motion will typically make the execution time vary drastically.

Likely to not be as drastic as a simple doubling of memory bandwidth time though, depending on the actual locations of the data involved, which would then be all about cache filling time wouldn’t it? In which case, I think, your typical case is likely to see next to no difference in speed a lot of the time on 64-bit CPUs.

Cas :slight_smile:

Awww I see the joke now ;D

…well that was embarrassing :persecutioncomplex:

Man this is nearly impossible to generalize. One thing to remember is that parts of the memory architecture are shared resources between all cores (and, of course, all active threads and processes) and the memory motion is typically going in both directions. Even when data is in the L1 cache (if my memory serves) then moving from L1 to register isn’t free. The microopcode accesses a port (which if free) delivers 32-bits in 5 cycles and 64-bits in 10. On the store side the store buffer will get filled quicker when moving data which will stall if filled.

In my experience there are drastic speed differences between using 32 vs. 64 bit types. With certainty I can say that moving less data will virtually never slow you down and moving more will the majority of the time. Of course don’t read into this that I’m suggesting to always use the smallest possible data size. Do what works for you.