JAVA IS SLOW!

I wrote two versions of my code, one recycling node objects via a pool (quite easy to do) and the other just leaving it to the GC. The performance gain from using the pool was unmeasureable (smaller than the variation in run time) and had the expense of extra lines of code and reduced clarity.
This is one problem with the benchmark in that the nodes live long enough to be copied out of eden whereas in real applications they don’t last that long.

My copy of “Introduction to Algorithms” (CLR) has this to say about Fibonacci heaps after noting their use in the asymptotically fastest algorithms for paths on dense graphs.

[quote]From a practical point of view, however, the constant factors and programming complexity of Fibonacci heaps make them less desirable than ordinary binary (or k-ary) heaps for most applications. Thus Fibonacci heaps are predominantly of theoretical interest.
[/quote]

Changing the division to multiply by the reciprocal (pre calculated) saves about 3%. Replacing the calculation with a really crude integer approximation saves around 10%. The algorithm works for any value larger than that calculated as shown. So the constant 28 will work for all heap sizes up to 1000000.

Toby,

I’d challenge you to write an app in C++ that has a large number of short lived objects that can do better then Hotspot using manual memory management.

Eden allocation consists of a single pointer increment. Eden usage is tracked with some very very clever datastructures. If
the eden contains totally short lived objects whose life is over at the time of collection, deallocation of ALL of them is again 1 pointer change.

Mark’s point which is VERY legitimate, is that by pooling your objects you keep them alive. By keeping them alive you DEFEAT the eden and end up creating a lot MORE work for the gc in moving them out of the eden and into long lived storage.

Hotspot GC is optimized for short lived objects because
(1) if you write good object oriented code then you WILL create a LOT of these short lived objects.
(2) Long lived objects are by definition, an issue much less often just because of their life-cycle.

Could you equal Hotspots allocation? Sure, by writing it over again, its all in C after all. But its very tricky, very highly evolved C code.

This is part of the problem with modern VMs. They are much more sophisticated then people realize and thus people tend to imagine they are working in very different ways then they really are. This can lead you to bad conclusions if you are not very careful.

Oh and Mark’s dead on, though terse, about Order if i understand him. Order is a scalability measure. By definition the larger N the more order OF N matters. The smaller the N, the more your constants effect the equation.

[quote]Oh and Mark’s dead on, though terse, about Order if i understand him. Order is a scalability measure. By definition the larger N the more order OF N matters. The smaller the N, the more your constants effect the equation.
[/quote]
Then why doesn’t Sun create a performance white paper, post it on java.net and put an end to the insanity once and for all. People complained that the PS2 was slow until Sony showed them how to make it fast. If people are making the wrong assumptions, then an expert (i.e. Sun) needs to tell them which assumptions are valid, which are invalid, and which are the worse case things. The hardware manus do it over and over again every generation as do the videocard manus. I haven’t seen a modern performance whitepaper come out of Sun in almost 2 generations of JVM.

Can’t blame people for doing things wrong if they’ve never been told the RIGHT way to do things. You can’t expect people to go and try to figure it out for themselves. If the optimal path is not obvious, expect people to do something not optimal.

That’s just my 0.02.

[quote]Toby,
I’d challenge you to write an app in C++ that has a large number of short lived objects that can do better then Hotspot using manual memory management.

Eden usage is tracked with some very very clever datastructures. If
the eden contains totally short lived objects whose life is over at the time of collection, deallocation of ALL of them is again 1 pointer change.
[/quote]
Yep. I’m pretty much doing that right now. It costs a pointer increment to allocate and costs me nothing to deallocate. Of course, I don’t have to spend cycles tracking what the lifetimes of my objects are, because I just know what they are. I can also guarantee that this optimization occurs where I need it to occur, and Hotspot can’t seem to figure it out. Maybe my objects are too long lived compared to Mark’s?

[quote]Mark’s point which is VERY legitimate, is that by pooling your objects you keep them alive.
[/quote]
I don’t pool them in the Java code, but I do in my C++ code. That was my point. The objects aren’t allocated in a single function and go away - like normal stack objects. They are pretty much around for the duration of the graph traversal - because they have to be. HotSpot probably isn’t smart enough to figure out how to optimize that allocation - though I can’t say for sure because I haven’t profiled the Java version of the app yet. I just know it runs really slow compared to the C++ version.

[quote]Could you equal Hotspots allocation? Sure, by writing it over again, its all in C after all. But its very tricky, very highly evolved C code.
[/quote]
Well, I wrote the “very tricky, very highly evolved C++ code” - It’s about seventy lines including comments and took me about half an hour.

[quote]This is part of the problem with modern VMs. They are much more sophisticated then people realize and thus people tend to imagine they are working in very different ways then they really are. This can lead you to bad conclusions if you are not very careful.
[/quote]
You’re totally misunderstanding what’s been said. Read my comments again. I specifically state that I’m not using pooling in Java code, because pooling works against the virtual machine (in the case of large numbers of small objects). I explicitly state that I use it in C++, because I can do it for free.

[quote]Oh and Mark’s dead on, though terse, about Order if i understand him. Order is a scalability measure. By definition the larger N the more order OF N matters. The smaller the N, the more your constants effect the equation.
[/quote]
Again - you aren’t listening. Read my posts again. What do you think I’m saying when I talk about the difference between mergesort and quickmergeinsertsort? Believe me, when working with nearly 100 million objects, I’m quite aware of the importance of both constant factors and algorithmic complexities.

I don’t have the time to say anything more right now, but I’ll get to the rest (Mark’s posts included) later.

God bless,
-Toby Reyelts

I think we all missed the point that you werent pooling in Java. Apologies there.

As for the “trickey” code, sure it becomes a lot easier if
(a) its sepcific to one data structure
and
(b) you never really have to worry abotu garbage.

The point about HS is that it gets the same kind of performance transparently on ALL short lived objects, even ones that aren’t copious enough to make a pool make any sense, AND still allows for automated garbage collection.

I would say this, by the time you start building C pools you are doing C optimization. This legitimizes any and all optimization of the Java code to Java’s quirks in return.

And in the end, I’ve always maintained that over-all todays Java and C++ show equivalent execution performance for equivalently (not identically) optimized code. This includes choosing the proper algorithms for the platforms.

There are some tasks and algorithms that java is naturally more suited to and will kick C++'s butt at (eg recrusive stuff). There are some kinds of algorithms that C is naturally better at (eg complex access into arrays) and will always come out ahead on.

Over-all though on complex problems Java today gives about the same results, for a lot less work. I think thats quite an accomplishment.

So in conclusion your problem is you asked the wrong question. the right question is not “why the heck isn’t Java as fast as C on this algorithym designed for C” but instead “can someone produce the same results to the same or better degree of accuracy in the same time in Java?”

The second sounds like an interesting coding challenge for someone here with free cycles :slight_smile:

Oh, and though I don’t have the time, I just feel compelled to do some bitching about the overhead of object headers. In C++ I can instantiate 70 million objects just fine. Try that with Java, and you’ll be hearing the wooshing sound of 560M of pure object header overhead flying by your head. No doubt the objects will be optimized for alignment, so you can be sure to hear hundreds of more megabytes of space go wooshing by in the form of padding.

What’s the only technique to combat this? Allocate everything as freaking arrays:

float node_x[]; // x values
float node_y[]; // y values

int[] link_to_node; // index of to node
int[] link_from_node; // index of from node

int[] node_link_indices; // indices of the links for individual nodes

and so on…

Now you have to use some annoying flyweight pattern all throughout your code. Plus you’ll be suffering array access penalties just about everywhere. sigh

The C++ code is about 10x more readable than the Java code. It’s rare that you’ll hear me make a statement like that.

On the other hand, I can happily say that I’m enjoying the new JNI direct byte buffer facilities in JDK1.4. That was a stroke of pure brilliance.

God bless,
-Toby Reyelts

I hear Shawn and Cas saying “Structures!” right about now…

Argghhh… must respond…

[quote]As for the “trickey” code, sure it becomes a lot easier if
(a) its sepcific to one data structure
and
(b) you never really have to worry abotu garbage.
[/quote]
It’s actually for two data structures and would work for any others. (Welcome to the marvels of templates!)

[quote]The point about HS is that it gets the same kind of performance transparently on ALL short lived objects, even ones that aren’t copious enough to make a pool make any sense, AND still allows for automated garbage collection.
[/quote]
On the other hand, stack allocation is trivial in C++, and in Java you just have to hope the VM is smart enough to handle it - or just plain squirm when it doesn’t.

[quote]I would say this, by the time you start building C pools you are doing C optimization. This legitimizes any and all optimization of the Java code to Java’s quirks in return.
[/quote]
Absolutely. I started this post asking what could be done to make the Java! fibonacci heap fast.

[quote]And in the end, I’ve always maintained that over-all todays Java and C++ show equivalent execution performance for equivalently (not identically) optimized code. This includes choosing the proper algorithms for the platforms.
[/quote]
Sometimes you can’t “choose” your algorithms, because the fastest known algorithms only have one basic implementation.

[quote]So in conclusion your problem is you asked the wrong question. the right question is not “why the heck isn’t Java as fast as C on this algorithym designed for C” but instead “can someone produce the same results to the same or better degree of accuracy in the same time in Java?”
[/quote]
That fibonacci heap was not designed for C. I wrote it in Java first and then ported it to C++. The question was, “how can you make this faster - at all?”

Now the question is really, “why is the entire application slower by orders of magnitude” (which, by the way, was also first written in Java, then ported to C++), “and what can be done to possibly make it faster?”.

Since I can’t just post thousands of lines of code (and people probably wouldn’t like that anyway), I’m going to have to resort to posting profiler results.

God bless,
-Toby Reyelts

Indeed; this is the same reason I’ve often requested someone (anyone, although Sun staff working on the VM would likely be a lot better at it than anyone else) do a regularly updated series on this (it was one of my suggestions for articles way back when…).

“regular” needs to be at least once per major Java version, and usually once per minor version too (lots of cases of performance “bugs” being fixed in minor version updates…)

[quote]Oh, and though I don’t have the time, I just feel compelled to do some bitching about the overhead of object headers. In C++ I can instantiate 70 million objects just fine. Try that with Java, and you’ll be hearing the wooshing sound of 560M of pure object header overhead flying by your head.
[/quote]
I. Feel. Your. Pain.

I did an academic project on evolving AI for computer games, via Genetic Programming (which is like GAlgos, except that it evolves parse-trees; this makes objects a much more natural data structure than packed arrays (which are trivial for GA)) - and made the twin mistakes of using Java and objects.

I had over 384MB RAM (this was about 4 years ago; maybe I had more), I stripped down the OS to about 40 MB total virtual memory, and still had to allocate many hundreds of MB additional swap space just to get a half-way decent evolution… Sigh. 50,000 parse-trees of arbitrary complexity implemented with one object per node is not a clever idea in java :), especially if you are iterating over all of them several times in a row; java + OS also showed no signs of predictive paging on the iterations, which hurt too.

[quote]Absolutely. I started this post asking what could be done to make the Java! fibonacci heap fast.
[/quote]
Which is not QUITE the right question. I’m not entirely sure if the point here is a fibonacci generator, which there are many approaches to, or a heap manager. You’ll pls excuse me as i dont have the cycles to dig in and figure the code out myself right now.

If the point is a heap allocator, then my question is “why?” at all when Java does your memory mangement for you. if the point is a fibonacci series as output then the question is how to get that output as fast as this approach is in C, not how to make this particular approach fast in java. Subtle but important difference.

I’d beg to differ in that you can always chose your data structures. However I will also grant that there are certain micro-tasks that C by nature of its insecurity will be faster, if scarier, at. There is a reason why we still distribute a C MPEG decoder with JMF. Optimal C in that case is about 10% faster then optimal Java.

Then why does it only have a pool in C? I’m confused again I’m afraid.

Ah a good question. What are you using this for?
Im assuming you already know its a hot-spot, so what is the purpose in your code? It may be that a totally different solution to the problem is warranted.

[quote]Which is not QUITE the right question.
[/quote]
No, it is the right question. I’m writing a component that calculates shortests paths. If you read the literature, you’ll see that a fibonacci heap is a type of priority queue which changes the algorithmic complexity of dijkstra’s (the only known algorithm that is guaranteed to find the shortest path in a graph) from O( v^2 ) to O( e + vlogv ). This is the fastest known algorithm for finding the shortest paths in a generic network. There are other implementations of dijkstra’s (aside from classic dijkstra’s) that can be used based on the sparseness/denseness of the network, maximum path limits, and other specifics (i.e. a radix-heap based dijkstra’s, dial’s, etc… ). We’ve investigated all of these and have our algorithm tweaked to work best with our particular problem and data set. The priority queue is critical, as the bulk of a path finding algorithm its spent accessing the priority queue.

[quote]I’d beg to differ in that you can always chose your data structures.
[/quote]
Not if you want the ones that are the fastest. Need O( 1 ) insert and O( 1 ) lookup on a sparse dataset? What choices do you have?

[quote]Then why does it only have a pool in C? I’m confused again I’m afraid.
[/quote]
Because in the port, I changed it from being “optimal in Java” to “optimal in C++”. While the fibonacci heap was almost exactly a line-for-line port, the pathing component had a few minor changes. The addition of pooling and the use of real objects instead of groups of arrays were the largest changes I made.

[quote]Ah a good question. What are you using this for?Im assuming you already know its a hot-spot, so what is the purpose in your code? It may be that a totally different solution to the problem is warranted.
[/quote]
I think I’ve already answered this.

God bless,
-Toby Reyelts

[quote]However I will also grant that there are certain micro-tasks that C by nature of its insecurity will be faster, if scarier, at. There is a reason why we still distribute a C MPEG decoder with JMF. Optimal C in that case is about 10% faster then optimal Java.
[/quote]
This is a beef of mine. C/C++ is still the absolute wrong way to do a MPEG (or JPEG) decoder it will be orders of magnitude slower than it should be. It is just a REALLY REALLY bad choice.

These are the exact things that MMX/SSE2 - SIMD in general are in the processor for.

Using generic ‘C’ codecs may get you a degree of portability… but the cost is HUGE on the performance side.

I suggested to some Sun guys many months ago to use Intel’s (once free) JPEG library. I happen to have an app that really needs good performance for decoding JPEG images and due to bugs in ImageIO that don’t allow me to decode successive JPEG images into the same image buffer without leaking memory and the general no-workingness of JMF, I was forced to use the general Toolkit image loading… which has pathetic performance.

Hmm. It may be you are working with far larger datasets then I’m used to. My first inclination for sparse data set insertion/lookup is hashtable. My second thought is B+ tree but thats because any time I’ve worked with such a large data set its been disk based.

My third thought is a commerical in memory database product like TimesTen and let someone else solve the problem, but thats because I have that option.

If you’ve profiled, what is it in the algorithym thats slowing you down? Are you having to implement a heap on an array? I could see access there being expensive. There might be caching strategies depending on the expected pattern of access…

Whats the profiler telling you about the sticking points?

http://java.sun.com/j2se/1.4.2/1.4.2_whitepaper.html

http://java.sun.com/docs/performance/

http://www.javaperformancetuning.com/

[quote]Oh, and though I don’t have the time, I just feel compelled to do some bitching about the overhead of object headers. In C++ I can instantiate 70 million objects just fine. Try that with Java, and you’ll be hearing the wooshing sound of 560M of pure object header overhead flying by your head.
[/quote]
I know exactly what you mean. I did notice that Blackdown have an RC1 version of 1.4.2 for the AMD64 on Linux available (and it uses HotSpot). Of course the object headers would then be even bigger but at least the address space limit would be much further out.
My own partial solution to this is to reduce the number of objects but not by going all the way to arrays. Classically to represent a road junction where U turns are prohibited, you end up with lots of vertices and arcs — many more than appear as visible objects on the ground. I combine these to give a single node object with more complex behavious plus link objects which each represent two (or more) arcs in the classical representation. The end result is that despite the object headers the typical data usage is less than the previous C++ implementation.
By the way would you mind saying what use you are making of the path calculations? Also how many of the intermediate vertex labels are needed once the path calculation is complete? I replace the labels which aren’t wanted at the end by a reference to a single dummy label (with -infinite cost) as soon as the vertex has been scanned and thus no longer needed. This helps keep the lifetimes of most such objects low.

I gather that your problem is dense. Sometimes it is possible to temporarily discard most of the arcs and solve the remaining sparse problem. If you can then prove that the result is optimal for the original problem, you are done, otherwise you have to add back a few more arcs and resolve. In the case I investigated in my PhD thesis doing all this was still faster than solving the original dense problem.

The advantage I find with Java is that I can get a lot more algorithm experiments done in the time available and thus am more likely to end up with a better algorithm than if I was still working in C++. Even an area as well studied as shortest path algorithms still has room for improved algorithms in special cases, especially if you have the freedom to slightly rephrase the question.

Perhaps I should be more specific. A white paper that has in it code that is performant and code that isn’t and then list why as a function of how the JVM functions. There are a wide variety of myths with respect to Java performance out there that just won’t die.

[quote]There are a wide variety of myths with respect to Java performance out there that just won’t die.
[/quote]
Unfortunately there people who continue to propagate some of these myths regardless of any evidence you offer them. Apart from these trolls, there are many other who inexplicably don’t even bother to look in the obvious place for information despite it being free.
So while more information on how to write effective code is always welcome, don’t expect it to have much impact on the propagation of myths. These myths are harder to kill than the demons in any game.