JAVA IS SLOW!

mthornton · January 13, 2004, 6:17pm

[quote]macroscopically, you want to "public static final " all your methods.
[/quote]
A waste of time. These days the JIT compiler will do this without the need for hints. Objects aren’t slow anymore either. In fact some techniques you might use to avoid them are more expensive than the code they were intended to improve. You will often be able to access a field in an object faster than an arbitrary element of an array (i.e. an access where the JIT can’t guarantee is within the array bounds and thus must emit code to check the bounds).

abies · January 13, 2004, 8:25pm

[quote]macroscopically, you want to "public static final " all your methods.
[/quote]
Current JITs are able to optimize even instance virtual calls perfectly in most cases. Anyway, EVEN if you don’t believe jit to do it for you, you need only static modified - final one is superflous (all static methods are ‘final’).

Jeff · January 13, 2004, 11:36pm

Yup. this " make everything static and final" business has been nothing but harmful to code for a few generations of JITs now.

Ofcourse if your stuck on an ancient VM then I won’t hazard a guess what the results would be.

Jeff · January 13, 2004, 11:43pm

[quote]writing fast java code reminds me of writing fast c code - avoid objects!
[/quote]
Um no. Fast C code is generally slow Java code today
(assuming modern JITs).

C is fastest doing everything through array refernces. In Java non-linear array referenes are expensive. C is slow generally at recursion, Java recursion can be blindingly fast. (Faster then a local array based stack in fact.)

Memory allocation in most C compilers is dirt slow. Memory allocation in Java is highly highly optimized.

As already mentioned calling through a class or an interface is no different then calling through a static method. This has been the case since at least JDK1.2

An old but still good introduction to some of the differences between how C and Java code execute can be found here. Java VMs have actually imrpoved at running “c-like” code in some ways sicne this article was written but the base lessons are still pretty accurate:

http://www.aceshardware.com/Spades/read.php?article_id=153

There are also some things on this in Steve and my book. In particular, the recursion test.

http://java.sun.com/docs/books/performance.

rreyelts · January 13, 2004, 11:58pm

[quote]I suspect that the fibonacci heap (or at least your implementation of it) is not well suited to Java. Try implementing a basic binary heap instead. The array used in the consolidate method is probably very cheap in C++ (stack allocated), but costs quite a bit more in Java.
[/quote]
If you can spot “problems” in my implementation, speak up now. That’s why I went through the trouble of gaining permission to post my code in the first place.

Telling me to implement a d-heap is just skirting the issue and is utterly pointless as that data structure doesn’t give me the algorithmic performance characteristics I need.

God bless,
-Toby Reyelts

rreyelts · January 14, 2004, 12:01am

[quote]In my case, the tests show that roughly 75% is due to the node creation, not the insertion per se.
If pre-creating the nodes is an option, then you will see a big increase in speed.
[/quote]
I would say that something is wrong with your timing then, because the profiler shows that the bulk of the time is spent in extractMin - not allocating nodes nor inserting nodes.

God bless,
-Toby Reyelts

rreyelts · January 14, 2004, 12:11am

[quote]The array used in the consolidate method is probably very cheap in C++ (stack allocated), but costs quite a bit more in Java.
[/quote]
You can’t allocate the array on the stack unless you use something platform dependent like alloca. I pull the heap allocation out from the function call anyway. I have to memset the array in every single call to consolidate, so it’s not exactly free.

I could do the same thing in Java, but it’s actually more expensive to clear the array than it is to heap allocate it from scratch. Of course, its kind of hard to measure the resulting effect on garbage collection.

God bless,
-Toby Reyelts

rreyelts · January 14, 2004, 12:28am

Just thought I’d let people know that I’ve finished doing the port and the results are dramatic. On a small size problem set, the C++ code is about five times faster. The gap between performance in the C++ and Java code increases rapidly (perhaps geometrically?) as the size of the data set increases. (We work with data sets ranging from hundreds of thousands of objects to about 100 million). Unfortunately, I can’t run the Java program with the largest data sets, because the virtual machine can’t allocate over 1.7G of heap space.

The algorithm used in both programs is identical. Of course, both programs try to take advantage of any naturally well performing features their respective languages carry.

God bless,
-Toby Reyelts

rreyelts · January 14, 2004, 12:48am

Are you trying to say that the current crop of VMs perform tail-recursion optimization? I’ve heard people (those who do things like implement Scheme on top of the JVM) complaining of the exact opposite - that it doesn’t happen.

God bless,
-Toby Reyelts

princec · January 14, 2004, 6:45am

Aha, now if the problem is scaling geometrically on Java and not in C it points to something else that’s scaling wrongly rather than your direct interpretation of the port. Could it be GC?

Ask yourself this though before losing more hair - would you rather have an implementation with guaranteed safe GC and null pointer detection that runs slower, or a super-fast but entirely unsafe version of the algorithm?

As this is a serverside task your doing IIRC, you’re probably better off trading performance for reliability and investing your money in faster processors for the server. Programmer time is very expensive indeed compared to CPU upgrades!

This is, of course, no real excuse for the algorithm being so slow in Java. What parameters are you running your benchmark with?

Cas

rreyelts · January 14, 2004, 1:01pm

[quote]Aha, now if the problem is scaling geometrically on Java and not in C it points to something else that’s scaling wrongly rather than your direct interpretation of the port. Could it be GC?
[/quote]
I’m not exactly sure how the performance changes with the size of the data set (not enough data set sizes to test on yet), but I do know that it sure as heck isn’t linear. It’s possible that it could be gc. I should just really take the time to run the profiler again.

[quote]Ask yourself this though before losing more hair - would you rather have an implementation with guaranteed safe GC and null pointer detection that runs slower, or a super-fast but entirely unsafe version of the algorithm?
[/quote]
That’s like asking me if I wouldn’t prefer bubble-sort over a mergequick-sort, because it’s easier to write a bubble-sort. Maybe if I was only dealing with 100 objects. I think we would have written it in assembly if we could have gotten the same performance improvements. (I don’t believe rewriting the app in assembly would really make a discernable difference, but the point holds).

In general, yes, we prefer Java over C++. C++ just doesn’t scale well over more than a few senior engineers who are well-versed in the language.

[quote]As this is a serverside task your doing IIRC, you’re probably better off trading performance for reliability and investing your money in faster processors for the server. Programmer time is very expensive indeed compared to CPU upgrades!
[/quote]
First, it will cost every single one of our customers more money for faster servers which makes up for the one-time software engineering costs. Second, you won’t find much faster processors than 2.4GHZ. Third, the difference in memory speed is even smaller. Fourth, it all doesn’t make any difference anyway because small linear hardware speedups don’t make up for geometric software slowdowns. The Java version doesn’t perform fast enough on even the smallest data sets.

[quote]This is, of course, no real excuse for the algorithm being so slow in Java. What parameters are you running your benchmark with?
[/quote]
It’s not really a benchmark anymore - it’s a small application (that depends heavily upon the fibonacci heap). Depending upon the data set size, we run it with max heap set anywhere from 512M to 1.7G with -server set. We tried using different min heap sizes, but that doesn’t really affect anything. We tried using the incremental garbage collector for kicks, but that gave us performance that was, urmmm… garbage.

God bless,
-Toby Reyelts

mthornton · January 14, 2004, 6:19pm

Runtime o() values aren’t always helpful if the constant factors are sufficiently different. What you need is an algorithm that performs well in practice. Given the benchmark example of princec, with the string creation taken out of the loop I get 4.5s on my machine for your code. A completely straight leftist tree written in 30 minutes from a text book takes 2.4s for the same test (and gives the same results). The Leftist tree code including benchmark comes to 121 lines compared with 503 for the fibonacci heap. I have done no performance tuning on it at all.

The fib heap should give an amortised o(1) time for insert vs o(log n) for the leftist tree. Both have o(log n) for deleteMin. The fib heap also has o(1) for decreaseKey.

Next thing to note is that the benchmark is not representative of any practical use of such trees. In path finding the maximum size of the tree is often about sqrt(n) where n is the total number of nodes inserted during the algorithm. The decreaseKey operation in my experience doesn’t happen often enough to spend much time optimising. My usual problem is computing minimum cost paths in road networks containing upto 2.5 million nodes and 3.3 million links. My latest Java code for this is an order of magnitude faster than the C++ code which preceded. Of course I changed the algorithm, but this was assisted by the ease of writing in Java, and the algorithm actually makes use of the behaviour of the garbage collector. It would be difficult to achieve the performance without a garbage collector (or an unlimited memory budget)!

So there you have it; C++ is REALLY REALLY SLOW!!!

mthornton · January 14, 2004, 6:34pm

Incidentally the size of the node object is quite a bit less for the leftist tree than for the fibonacci heap. Running with the -server option the times are 1.6 for the leftist tree vs about 3 for the fib heap. However the timing for the fib heap was much more variable than that for the leftist tree, possibly because of the extra data being allocated (more gc operations).

I also put a loop around princec benchmark and ran it quite a few times to make sure the JIT was properly warmed up.

The following is an adequate description of a leftist tree.

http://www.dgp.toronto.edu/people/JamesStewart/378notes/10leftist/

rreyelts · January 14, 2004, 7:00pm

No, really? rolls eyes I guess that’s why we use mergequickinsertsort instead of mergesort. sigh

[quote]The fib heap should give an amortised o(1) time for insert vs o(log n) for the leftist tree. Both have o(log n) for deleteMin. The fib heap also has o(1) for decreaseKey.
[/quote]
Yes indeed - those are the performance characteristics for a fib heap.

[quote]Next thing to note is that the benchmark is not representative of any practical use of such trees.
[/quote]
Yes, and the “application” which I mentioned in my last post is quite representative of a practical use of such trees. It is the application which we are benchmarking now, and not just the fibonacci heap.

[quote]The decreaseKey operation in my experience doesn’t happen often enough to spend much time optimising.
[/quote]
I agree that decreaseKeys are less important for our particular application, but that doesn’t negate the fact that they are faster and that binary heaps have O(log n) costs to pay on inserts.

In general, whether decreaseKey is important for you or not depends upon exactly what algorithm you are implementing. Standard Dijkstra’s (regardless of the type of queue involved) requires an inordinate amount of decreaseKeys and was precisely one of the reasons the fibonacci heap was invented.

[quote]My latest Java code for this is an order of magnitude faster than the C++ code which preceded. Of course I changed the algorithm, but this was assisted by the ease of writing in Java, and the algorithm actually makes use of the behaviour of the garbage collector. It would be difficult to achieve the performance without a garbage collector (or an unlimited memory budget)!
[/quote]
I’d love to know what you mean by “taking advantage” of the garbage collector. I can guarantee you that I can write a C++ application that spends less cpu cycles in memory management than your Java application. In fact, these kinds of applications just scream aloud for manual memory management.

[quote]So there you have it; C++ is REALLY REALLY SLOW!!!
[/quote]
Now you’re just plain being silly, Mark. What you’re trying to tell me is to use a slower data structure, because current Java VMs aren’t capable of handling a faster data structure. Yes, that certainly encourages me to stick with Java.

Look folks, I’m not trolling here. I just want fast running Java code, and it’s not happening. Mark managed to eek out a constructive comment, which was that he finds that the current vms aren’t good with fibonacci heaps (for whatever reason that may be). So, I’ll try out a 2-heap with the Java version.

Thanks Mark.

God bless,
-Toby Reyelts

mthornton · January 14, 2004, 7:28pm

It depends on the data — on my data the number of decreaseKeys is small relative to the number of inserts.

Many of the paths are very short, thus only a small part of the total graph need be scanned. However you still have to initialise all the nodes (or at least appear to). I use special labels which can be distinguished as belonging to a previous computation. I end up with an unknown number of references to these and just leave it to the garbage collector to dispose of them when they are no longer relevant. This is something that a good garbage collector can do very efficiently. The simplest solution in C++ would be to just tolerate a slow leak or do reference counting which is nasty and not all that fast.

Well there may be a better way to implement a fib heap in Java, but the complexity of the algorithm makes it more difficult to determine where the performance problem really lies. You might want to try using a red/black tree for comparison as well. Sometimes it is easier to determine the best implementation if you have a selection of competing alternatives — they may offer a hint as to what is going wrong with the one that ought to be best according to theory.

Best of luck,
Mark

crystalsquid · January 14, 2004, 7:42pm

As a C++ programmer who has recently moved to Java, there are 2 aspects of your code that I would happily do in C++, but would avoid like the plague in Java (perhaps wrongly, but…) and they are:

assert() - lovely function, but no pre-compiler means its still gonna give you a hit at runtime.

Inlining - or the complete lack of it. I would suspect the asserts, and all your lovely access functions would not be removed, and if I recall correctly, they incur what would be a virtual function overhead (jump to pointer from table) that is gonna screw up the branch prediction on the CPU.

Now in an optimised C++, the aggresive inline would make all these go away, and I would expect a 4x speed up over the non-inlined version.

That array allocate also scares me a bit. Are you sure clearing a static array is really slower than allocating a new one?!? How about arraycopy’ing a zero’d array over the static one - rumour has it that arraycopy is nice & fast.

Java dudes - feel free to shoot my paranoid fantasies down in flames

Dom

princec · January 14, 2004, 8:18pm

The server VM completely optimises asserts away, and performs excellent - better than C++ from what I can tell - inlining of code. Even the virtual function calls get eliminated, ISTR.

The client VM is nowhere near as hot

The array allocation is extremely cheap in Java, nearly the same speed as a stack allocation in C++. The GC is also similarly very fast as the allocated arrays never make it out of Eden.

Have you tried -XX:MaxInlineSize=128 (or better) with -server?
(And use -XX:CompileThreshold=1000 to shorten the time it takes to get benchmarking results)

Cas

rreyelts · January 14, 2004, 8:20pm

Yes, that is true too. Dense networks make for lots of decreaseKeys.

[quote]Many of the paths are very short, thus only a small part of the total graph need be scanned. However you still have to initialise all the nodes (or at least appear to). I use special labels which can be distinguished as belonging to a previous computation. I end up with an unknown number of references to these and just leave it to the garbage collector to dispose of them when they are no longer relevant. This is something that a good garbage collector can do very efficiently. The simplest solution in C++ would be to just tolerate a slow leak or do reference counting which is nasty and not all that fast.
[/quote]
I don’t scan the entire graph either. The only dynamic allocation I perform per route is a small constant (about 100 allocations of perhaps a few Kb total). I allocate all other references from pools and bulk release those references when I’m done. Allocation and deallocation are almost exactly free.

[quote]Well there may be a better way to implement a fib heap in Java, but the complexity of the algorithm makes it more difficult to determine where the performance problem really lies.
[/quote]
Part of the benefit of implementing the fibheap in C++ is that the allocation of the fibheap nodes is no longer a cost. Unfortunately, pooling/recycling large numbers of small objects in Java typically slows down an application, where it can be used in C++ to effectively entirely remove the cost of allocation and deallocation.

God bless,
-Toby Reyelts

Jeff · January 14, 2004, 9:10pm

Yep all of the above is true

Don’t have much to add except to say the last 2 posters have it 100% correct

mthornton · January 15, 2004, 5:51am

[quote]int largestDegree = ( int ) Math.floor( Math.log( size ) / LOG_PHI );
[/quote]
I don’t suppose there is any performance problem in this bit? I haven’t tested it, it is just one of the items that caught my eye as worth checking.