Microbenchmark - new vs. reuse

walter_bruce · August 20, 2003, 5:53pm

I’ve created a little microbenchmark test the relative costs of a few different ways of getting access to a temporary object inside a short method, and I thought the results might be of interest to other people.

My test computes the cross product of two 3D vectors and returns the result in the first vector. This is a small but real-world-useful operation, and it requires some temporary space for the cross product. I coded up multiple versions that used different techniques to get the needed temporary space as follows:
[1] Local var. This method just used local double variables for its temporary space. This is the only version that does not use an object (a Vector3d) for its temporary space.
[2] New. Allocate a new Vector3d each time the method is called.
[3] ThreadLocal. Get a temporary object using a ThreadLocal object
[4] Field. Get temporary object stored in a private field.
[5] Field sync. Synchronized method which gets its temporary from a private field as in [4].
[6] TempStack. Get temporary object from a TempStack which is essentially a object pool where objects must be returned in the reverse-order that they were gotten. TempStack is obtained using a ThreadLocal
[7] TempStack param. Use a TempStack passed in an explicit extra parameter. Ugly in that it requires an extra parameter but can be relatively fast.
Method 4 is not thread-safe and methods 3, 4, and 5 cannot be used in recursive methods. Method 2 is the cleanest of the object based methods, but how does its performance compare to the other? Here are some timings from my 1.7GHz Pentium 4 machine:

[tr][td]Test +++++++++++ [/td]
[td]JVM 1.4.2 -client [/td]
[td]JVM 1.4.2 -server [/td][/tr]
[tr]td Local var[/td][td]0.076[/td][td]0.015[/td][/tr]
[tr]td New[/td][td]0.144[/td][td]0.120[/td][/tr]
[tr]td ThreadLocal[/td][td]0.100[/td][td]0.039[/td][/tr]
[tr]td Field[/td][td]0.048[/td][td]0.040[/td][/tr]
[tr]td Field sync[/td][td]0.205[/td][td]0.216[/td][/tr]
[tr]td TempStack[/td][td]0.127[/td][td]0.047[/td][/tr]
[tr]td TempStack param[/td][td]0.069[/td][td]0.016[/td][/tr]

Times are in microseconds per method call and you can get the complete source code here http://www.graphics.cornell.edu/~bjw/CrossProductTest.java

A few thing to note from the results
[] As others have noted, under 1.4.2 -server is much faster for floating point code than -client
[] The difference between (1) and (2) gives the approximate cost of allocating and garbage collecting the temporary Vector3d object. Allocation increases the cost of the cross product method by a factor between 2 (client) and 8 (server), so it is still a very significant cost in this case.
[] The synchronized method is the most expensive in all cases, so it is still best to avoid synchronization when possible.
[] We use a technique similar to (7) in performance-critical sections of our own code, and I would happily change the code to something cleaner like (2), if the cost was small. However the cleaner object-based techniques are still significantly slower.
[] Although I expected (1) to be the fastest, under -client it turns out to be actually slower than (4) and (7) for reasons I don’t understand.
[] The field method (4) seems to be relatively slow under -server, for reasons I also don’t understand.

Caveat: This is a microbenchmark and performance may be different in real applications. I think garbage collection is a wonderful thing, and I do not advocate abandoning it for object pools except when really necessary for performance reasons (preferably after profiling your code first). Comments and critiques are welcome.

swpalmer · August 20, 2003, 10:03pm

Mac OS X 10.2.6  Java 1.4.1-client  (1GHz)
Cross product local variable speed testing
1)local double var.... avg 0.082891665 usecs
2)new................. avg 0.14235833 usecs
3)ThreadLocal..........avg 0.14793333 usecs
4)Instance field.......avg 0.06655 usecs
5)Instance field sync..avg 0.12054167 usecs
6)TempStack............avg 0.16006666 usecs
7)TempStack param......avg 0.09599167 usecs

Note that #5 is cheaper than #2,3,6 which is significantly different from your results with the 1.4.2 Intel VM

Mac OS X 10.2.6  Java 1.4.1-server  (1GHz)
Cross product local variable speed testing
1)local double var.....avg 0.08573333 usecs
2)new..................avg 0.1584 usecs
3)ThreadLocal..........avg 0.153775 usecs
4)Instance field.......avg 0.06821667 usecs
5)Instance field sync..avg 0.124858335 usecs
6)TempStack............avg 0.168125 usecs 
7)TempStack param......avg 0.10123333 usecs

AndersDahlberg · August 21, 2003, 10:43am

redhat 9, jre 1.4.2


local double var     avg 0.053925 usecs   total 6.471 secs
new                  avg 0.20695834 usecs   total 24.835 secs
ThreadLocal          avg 0.11985833 usecs   total 14.383 secs
Instance field       avg 0.04611667 usecs   total 5.534 secs
Instance field sync  avg 0.046491668 usecs   total 5.579 secs
TempStack            avg 0.137975 usecs   total 16.557 secs
TempStack param      avg 0.0551 usecs   total 6.612 secs

-server


local double var     avg 0.036875002 usecs   total 4.425 secs
new                  avg 0.17416666 usecs   total 20.9 secs
ThreadLocal          avg 0.082575 usecs   total 9.909 secs
Instance field       avg 0.043675 usecs   total 5.241 secs
Instance field sync  avg 0.056941666 usecs   total 6.833 secs
TempStack            avg 0.09243333 usecs   total 11.092 secs
TempStack param      avg 0.0455 usecs   total 5.46 secs

EDIT: added a server run + modified client (as the previous one was taken while doing a lot of other stuff… idea, xmms, xine etc…)

walter_bruce · August 21, 2003, 12:45pm

Thanks for the additional data points. A few observations
[] All three client JVMs seem to preserve the oddity that using a field (4) is cheaper than using local variables (1). I still don’t understand why this is the case.
[] The relative cost of a synchronized method (5) seems to be lower on MacOSX and much lower under Redhat as compared to my Windows results.
[] The relative cost of new and garbage collection seems slightly lower under MacOSX but significantly higher under Redhat. It still large enough in all cases to be a potential bottleneck in truly performance-critical code.
[] Under MacOSX, -client and -server are not significantly different which is not surprising since my understanding was that -server is ignored under the current MacOSX JVM.

AndersDahlberg · August 21, 2003, 1:13pm

If you’re interested I could do a test on the 2.6.0-test3 kernel too (previous results comes from 2.4)?

swpalmer · August 21, 2003, 1:29pm

Yes, there is only a single shared library for the hotspot VM on Mac OS X. I didn’t realise at first, but the server shared library and client shared library files are both there, but as alias’s to a single ‘hotspot’ library.

I guess the slight differences are caused by different parameter values to the same VM (e.g. compile thresholds, size of young generation in heap etc.)

I too have no clue why a field is faster than a stack variable - weird. If only there was a way to disassemble the native ode produced by HotSpot.

walter_bruce · August 22, 2003, 6:09pm

I ran some more tests to see why the cost of synchronization seemed to vary so much and it seems to depend on whether you are running on a single processor or dual processor machine. My previous tests were run on a dual processor which I didn’t mention because it didn’t seem relevant since the test is entirely single threaded (ie the other processor just sits idle). However I’ve gone back and redone my timings (with fewer other applications open) on single and dual processor machines which are otherwise similar and here are the results:

[tr][td] JVM 1.4.2 [/td]
[td] client [/td][td] client [/td][td]| [/td]
[td] server [/td][td] server [/td][/tr]
[tr][td] 1.7GHz P4 [/td]
[td] single [/td][td] dual [/td][td]| [/td]
[td] single [/td][td] dual [/td][/tr]
[tr]td Local var ---------[/td]
[td]0.077[/td][td]0.076[/td][td]| [/td]
[td]0.012[/td][td]0.012[/td][/tr]
[tr]td New --------------[/td]
[td]0.070[/td][td]0.141[/td][td]| [/td]
[td]0.056[/td][td]0.124[/td][/tr]
[tr]td ThreadLocal -----[/td]
[td]0.102[/td][td]0.100[/td][td]| [/td]
[td]0.043[/td][td]0.039[/td][/tr]
[tr]td Field -------------[/td]
[td]0.043[/td][td]0.042[/td][td]| [/td]
[td]0.045[/td][td]0.043[/td][/tr]
[tr]td Field sync -------[/td]
[td]0.057[/td][td]0.231[/td][td]| [/td]
[td]0.055[/td][td]0.178[/td][/tr]
[tr]td TempStack ------[/td]
[td]0.121[/td][td]0.128[/td][td]| [/td]
[td]0.045[/td][td]0.047[/td][/tr]
[tr]td TempStack param [/td]
[td]0.053[/td][td]0.072[/td][td]| [/td]
[td]0.016[/td][td]0.016[/td][/tr]

[] Most results are similar except for the cost of new (2) and synchronization (5) are much higher on a dual processor machine.
[] Adding synchronized to a method is virtually free on a single processor (assuming no contention), but fairly expensive on a dual processor
[] Using new (2) on single processor machine under the client JVM seems to be reasonably fast. Only the field (4) and field sync (5) methods are faster and not by that much. However on a dual processor or when using -server, then there are other techniques that are much faster than using new.
[] TempStack param (7) also seems to slows down somewhat on a dual processor for reasons I don’t understand, but only under -client, not -server.
[*] Would we see the same slowdowns on a single processor machine with HyperThreading enabled? (It was not enabled in any of my tests.) I’ll try to test this if I can find a suitable machine.

I wonder if the JVM actually generates different code on a single vs. dual processor machine or if there is something else going here (cache effects? context switching?).

princec · August 22, 2003, 6:27pm

The PPC architecture uses register windowing and suchlike doesn’t it from its RISC beginnings? It’s pretty poorly adapted for stack-based architectures like the JVM, whereas the x86, with its paucity of general purpose registers, is much better at stack ops. So let’s guess that the fields get mirrored into registers on PPC.

Cas

swpalmer · August 22, 2003, 7:02pm

Yes dual processor makes a big difference for synchronisation. For single CPU raising the IRQ level is enough to prevent a context switch and therefore gain exclusive access for a moment. With dual CPUs fancier mechanisms must be used. On windows the kernel will use spinlocks in dual processor mode, operations that are noops with the single processor kernel.

I don’t know if Mac OS X has the same distinction for synchronisation operations. I get the feeling that in the world of Macs dual processor machines are much more popular than in the world of windows.

Cas, I’m not sure why the PPC would be any worse at stack ops, the compiler can choose any register to be the stack pointer and it will work pretty much the same as an Intel stack. I believe the available addressing modes will mimic the push,pop without the need for any extra instructions or longer execution times. The proper set of general purpose registers is, in most cases, a win. The main problem, until recently with IBMs latest PPC chips, has been the lagging clock speeds for the PPC CPUs. But this discussion is for another thread if it is worth pursuing at all

So who has a theory on the field versus local observations?

NVaidya · August 24, 2003, 4:54pm

As others have noted, under 1.4.2 -server is much faster for >floating point code than -client

For something quite fundamental as floating point performance,
just would like to know the reason behind this discrepancy
between client and server options.

As a rough benchmark, I ran some cases with my Java3D particle
tracking algorithm that involves fully double precision calcs.,
newing of dynamic primitive arrays and objects of that type only,
no synchronization anywhere, extensive polymorphic method calls
(since the cells are different kinds of polyhedron), no accounting
for gc times, and no newing of objects within loops.

The time taken for creating 1000 traces for repeated invocations without any pauses are (in secs):

client: 30.98, 31.20, 32.84, 20.98, 21.48, 20.82
server: 25.02, 10.32, 11.15, 10.65, 10.25. 10.27

Looks like after the initial warm-up period, the server is roughly
twice as fast.

What would it take to get Sun to spruce up the client so it would number crunch as fast as the server ?

Edit: Forgot to ask if anyone has compared Java vs C++ FP
performance with the JVM in server mode ? That may be very
interesting…I have done some here but not exactly apples to
apples comparisons.

swpalmer · August 25, 2003, 2:45am

I suspect the reason is that the server is allowed to spend more time compiling bytecode to machine code, and the algorithm required to generate more optimized floating point instructions (e.g. the SSE instructs that are used on Intel by the server VM) requires too much processing time in the compiler. So for a client VM the lag caused by a runtime compilation pause would be considered unacceptable. That’s my guess anyway.

BTW… I did a test 10 months ago with MS Visual C++ 6.0 to do a simple conversion from RGB colour space to YUV. The Java version ran faster than the C++ .exe. The reason I suspect was the very poor performance of the Microsoft compiler for floating point to integer conversions. Apparently it sets the floating point rounding mode twice every time a conversion to int is required (once to set the mode to C style rounding, once to set it back to natural round to the closest number). Intel’s C++ compiler at the time was considered vastly superior. The GNU C++ compiler on Intel at least prior to version 3 also produced VERY poor code… so bad that I know of one project that abandoned the idea of a Linux port because they didn’t want to release something with such poor performance and possibly ruin the reputation of the company.

walter_bruce · August 25, 2003, 7:44pm

— Field (4) faster than local variables (1) under -client
I’ve profiled the code using VTune which has the side benefit of allowing one to view the assembly code produced by the hotspot compiler. I’ve posted the resulting assembly code for the local variable and field routines here (sorry for the strange formatting):
http://www.graphics.cornell.edu/~bjw/CPTLocalVarClient.txt
http://www.graphics.cornell.edu/~bjw/CPTFieldClient.txt
I’m not an x86 assembly expert but perhaps there is an expert out there who can analyze the differences. One thing I noticed is that local variable code computes the results using fp registers and then copies the results using int registers while the field code uses fp registers throughout.

— Field (4) much slower than local variables (1) under -server
This turns out to be an inlining effect. Hotspot inlines method (1) by default but not method (4). If I disable all inlining (using the -XX:MaxInlineSize=1 -XX:FreqInlineSize=1 flags) then the local var method slows down to 0.038 us (or just a hair faster than field). However I could not find any parameter setting that would convince hotspot to inline the field method (4) the same way it is inlining method (1) by default.

Incidently I don’t think its actually any more difficult to generate the SSE/2 instructions instead of x87 for the floating point code (in fact the SSE/2 code is simpler and probably easier to generate). Hotspot -server does not use the SIMD parts of SSE/2, just the scalar instructions as shown in the assembly code linked below. I think SSE/2 fp code is a feature that is likely to migrate down into the client JVM in the next version, especially if enough people request it.
http://www.graphics.cornell.edu/~bjw/CPTLocalVarServer.txt

NVaidya · August 26, 2003, 12:08pm

Much appreciate the info swpalmer and walter_bruce.

Speedups of 2-8 are quite incredible - imagine all the hoopla
people go thru’ with itty-bitty micro-optimizations.

Anyone interested in pursuing this further and taking it to Sun for
the client option ? I’d love that.

Vaidya

Edit: OK ! The upper bound should actually read 5 and not 8

based on the microbenchmark.