JAVA IS SLOW!

rreyelts · January 15, 2004, 9:06pm

Oh, and though I don’t have the time, I just feel compelled to do some bitching about the overhead of object headers. In C++ I can instantiate 70 million objects just fine. Try that with Java, and you’ll be hearing the wooshing sound of 560M of pure object header overhead flying by your head. No doubt the objects will be optimized for alignment, so you can be sure to hear hundreds of more megabytes of space go wooshing by in the form of padding.

What’s the only technique to combat this? Allocate everything as freaking arrays:

float node_x[]; // x values
float node_y[]; // y values

int[] link_to_node; // index of to node
int[] link_from_node; // index of from node

int[] node_link_indices; // indices of the links for individual nodes

and so on…

Now you have to use some annoying flyweight pattern all throughout your code. Plus you’ll be suffering array access penalties just about everywhere. sigh

The C++ code is about 10x more readable than the Java code. It’s rare that you’ll hear me make a statement like that.

On the other hand, I can happily say that I’m enjoying the new JNI direct byte buffer facilities in JDK1.4. That was a stroke of pure brilliance.

God bless,
-Toby Reyelts

Jeff · January 15, 2004, 9:09pm

I hear Shawn and Cas saying “Structures!” right about now…

rreyelts · January 15, 2004, 9:31pm

Argghhh… must respond…

[quote]As for the “trickey” code, sure it becomes a lot easier if
(a) its sepcific to one data structure
and
(b) you never really have to worry abotu garbage.
[/quote]
It’s actually for two data structures and would work for any others. (Welcome to the marvels of templates!)

[quote]The point about HS is that it gets the same kind of performance transparently on ALL short lived objects, even ones that aren’t copious enough to make a pool make any sense, AND still allows for automated garbage collection.
[/quote]
On the other hand, stack allocation is trivial in C++, and in Java you just have to hope the VM is smart enough to handle it - or just plain squirm when it doesn’t.

[quote]I would say this, by the time you start building C pools you are doing C optimization. This legitimizes any and all optimization of the Java code to Java’s quirks in return.
[/quote]
Absolutely. I started this post asking what could be done to make the Java! fibonacci heap fast.

[quote]And in the end, I’ve always maintained that over-all todays Java and C++ show equivalent execution performance for equivalently (not identically) optimized code. This includes choosing the proper algorithms for the platforms.
[/quote]
Sometimes you can’t “choose” your algorithms, because the fastest known algorithms only have one basic implementation.

[quote]So in conclusion your problem is you asked the wrong question. the right question is not “why the heck isn’t Java as fast as C on this algorithym designed for C” but instead “can someone produce the same results to the same or better degree of accuracy in the same time in Java?”
[/quote]
That fibonacci heap was not designed for C. I wrote it in Java first and then ported it to C++. The question was, “how can you make this faster - at all?”

Now the question is really, “why is the entire application slower by orders of magnitude” (which, by the way, was also first written in Java, then ported to C++), “and what can be done to possibly make it faster?”.

Since I can’t just post thousands of lines of code (and people probably wouldn’t like that anyway), I’m going to have to resort to posting profiler results.

God bless,
-Toby Reyelts

blahblahblahh · January 15, 2004, 9:34pm

Indeed; this is the same reason I’ve often requested someone (anyone, although Sun staff working on the VM would likely be a lot better at it than anyone else) do a regularly updated series on this (it was one of my suggestions for articles way back when…).

“regular” needs to be at least once per major Java version, and usually once per minor version too (lots of cases of performance “bugs” being fixed in minor version updates…)

blahblahblahh · January 15, 2004, 9:43pm

[quote]Oh, and though I don’t have the time, I just feel compelled to do some bitching about the overhead of object headers. In C++ I can instantiate 70 million objects just fine. Try that with Java, and you’ll be hearing the wooshing sound of 560M of pure object header overhead flying by your head.
[/quote]
I. Feel. Your. Pain.

I did an academic project on evolving AI for computer games, via Genetic Programming (which is like GAlgos, except that it evolves parse-trees; this makes objects a much more natural data structure than packed arrays (which are trivial for GA)) - and made the twin mistakes of using Java and objects.

I had over 384MB RAM (this was about 4 years ago; maybe I had more), I stripped down the OS to about 40 MB total virtual memory, and still had to allocate many hundreds of MB additional swap space just to get a half-way decent evolution… Sigh. 50,000 parse-trees of arbitrary complexity implemented with one object per node is not a clever idea in java :), especially if you are iterating over all of them several times in a row; java + OS also showed no signs of predictive paging on the iterations, which hurt too.

Jeff · January 15, 2004, 9:47pm

[quote]Absolutely. I started this post asking what could be done to make the Java! fibonacci heap fast.
[/quote]
Which is not QUITE the right question. I’m not entirely sure if the point here is a fibonacci generator, which there are many approaches to, or a heap manager. You’ll pls excuse me as i dont have the cycles to dig in and figure the code out myself right now.

If the point is a heap allocator, then my question is “why?” at all when Java does your memory mangement for you. if the point is a fibonacci series as output then the question is how to get that output as fast as this approach is in C, not how to make this particular approach fast in java. Subtle but important difference.

I’d beg to differ in that you can always chose your data structures. However I will also grant that there are certain micro-tasks that C by nature of its insecurity will be faster, if scarier, at. There is a reason why we still distribute a C MPEG decoder with JMF. Optimal C in that case is about 10% faster then optimal Java.

Then why does it only have a pool in C? I’m confused again I’m afraid.

Ah a good question. What are you using this for?
Im assuming you already know its a hot-spot, so what is the purpose in your code? It may be that a totally different solution to the problem is warranted.

rreyelts · January 15, 2004, 11:04pm

[quote]Which is not QUITE the right question.
[/quote]
No, it is the right question. I’m writing a component that calculates shortests paths. If you read the literature, you’ll see that a fibonacci heap is a type of priority queue which changes the algorithmic complexity of dijkstra’s (the only known algorithm that is guaranteed to find the shortest path in a graph) from O( v^2 ) to O( e + vlogv ). This is the fastest known algorithm for finding the shortest paths in a generic network. There are other implementations of dijkstra’s (aside from classic dijkstra’s) that can be used based on the sparseness/denseness of the network, maximum path limits, and other specifics (i.e. a radix-heap based dijkstra’s, dial’s, etc… ). We’ve investigated all of these and have our algorithm tweaked to work best with our particular problem and data set. The priority queue is critical, as the bulk of a path finding algorithm its spent accessing the priority queue.

[quote]I’d beg to differ in that you can always chose your data structures.
[/quote]
Not if you want the ones that are the fastest. Need O( 1 ) insert and O( 1 ) lookup on a sparse dataset? What choices do you have?

[quote]Then why does it only have a pool in C? I’m confused again I’m afraid.
[/quote]
Because in the port, I changed it from being “optimal in Java” to “optimal in C++”. While the fibonacci heap was almost exactly a line-for-line port, the pathing component had a few minor changes. The addition of pooling and the use of real objects instead of groups of arrays were the largest changes I made.

[quote]Ah a good question. What are you using this for?Im assuming you already know its a hot-spot, so what is the purpose in your code? It may be that a totally different solution to the problem is warranted.
[/quote]
I think I’ve already answered this.

God bless,
-Toby Reyelts

swpalmer · January 16, 2004, 2:17am

[quote]However I will also grant that there are certain micro-tasks that C by nature of its insecurity will be faster, if scarier, at. There is a reason why we still distribute a C MPEG decoder with JMF. Optimal C in that case is about 10% faster then optimal Java.
[/quote]
This is a beef of mine. C/C++ is still the absolute wrong way to do a MPEG (or JPEG) decoder it will be orders of magnitude slower than it should be. It is just a REALLY REALLY bad choice.

These are the exact things that MMX/SSE2 - SIMD in general are in the processor for.

Using generic ‘C’ codecs may get you a degree of portability… but the cost is HUGE on the performance side.

I suggested to some Sun guys many months ago to use Intel’s (once free) JPEG library. I happen to have an app that really needs good performance for decoding JPEG images and due to bugs in ImageIO that don’t allow me to decode successive JPEG images into the same image buffer without leaking memory and the general no-workingness of JMF, I was forced to use the general Toolkit image loading… which has pathetic performance.

Jeff · January 16, 2004, 2:36am

Hmm. It may be you are working with far larger datasets then I’m used to. My first inclination for sparse data set insertion/lookup is hashtable. My second thought is B+ tree but thats because any time I’ve worked with such a large data set its been disk based.

My third thought is a commerical in memory database product like TimesTen and let someone else solve the problem, but thats because I have that option.

If you’ve profiled, what is it in the algorithym thats slowing you down? Are you having to implement a heap on an array? I could see access there being expensive. There might be caching strategies depending on the expected pattern of access…

Whats the profiler telling you about the sticking points?

mthornton · January 16, 2004, 6:29am

http://java.sun.com/j2se/1.4.2/1.4.2_whitepaper.html

http://java.sun.com/docs/performance/

http://www.javaperformancetuning.com/

mthornton · January 16, 2004, 7:07am

[quote]Oh, and though I don’t have the time, I just feel compelled to do some bitching about the overhead of object headers. In C++ I can instantiate 70 million objects just fine. Try that with Java, and you’ll be hearing the wooshing sound of 560M of pure object header overhead flying by your head.
[/quote]
I know exactly what you mean. I did notice that Blackdown have an RC1 version of 1.4.2 for the AMD64 on Linux available (and it uses HotSpot). Of course the object headers would then be even bigger but at least the address space limit would be much further out.
My own partial solution to this is to reduce the number of objects but not by going all the way to arrays. Classically to represent a road junction where U turns are prohibited, you end up with lots of vertices and arcs — many more than appear as visible objects on the ground. I combine these to give a single node object with more complex behavious plus link objects which each represent two (or more) arcs in the classical representation. The end result is that despite the object headers the typical data usage is less than the previous C++ implementation.
By the way would you mind saying what use you are making of the path calculations? Also how many of the intermediate vertex labels are needed once the path calculation is complete? I replace the labels which aren’t wanted at the end by a reference to a single dummy label (with -infinite cost) as soon as the vertex has been scanned and thus no longer needed. This helps keep the lifetimes of most such objects low.

I gather that your problem is dense. Sometimes it is possible to temporarily discard most of the arcs and solve the remaining sparse problem. If you can then prove that the result is optimal for the original problem, you are done, otherwise you have to add back a few more arcs and resolve. In the case I investigated in my PhD thesis doing all this was still faster than solving the original dense problem.

The advantage I find with Java is that I can get a lot more algorithm experiments done in the time available and thus am more likely to end up with a better algorithm than if I was still working in C++. Even an area as well studied as shortest path algorithms still has room for improved algorithms in special cases, especially if you have the freedom to slightly rephrase the question.

gregorypierce · January 16, 2004, 6:38pm

Perhaps I should be more specific. A white paper that has in it code that is performant and code that isn’t and then list why as a function of how the JVM functions. There are a wide variety of myths with respect to Java performance out there that just won’t die.

mthornton · January 16, 2004, 6:48pm

[quote]There are a wide variety of myths with respect to Java performance out there that just won’t die.
[/quote]
Unfortunately there people who continue to propagate some of these myths regardless of any evidence you offer them. Apart from these trolls, there are many other who inexplicably don’t even bother to look in the obvious place for information despite it being free.
So while more information on how to write effective code is always welcome, don’t expect it to have much impact on the propagation of myths. These myths are harder to kill than the demons in any game.

gregorypierce · January 16, 2004, 7:37pm

Yeah, but once we get past the trolls and the lazy - we get to people who genuinely want to understand.

For example - if I create a ByteBuffer and pack it with ints and my architecture does specific things with byte alignment, what does the VM do. Is there any way to influence this behavior. How do I notice this behavior. Same with things like images, textures, image formats (I know OSX is quite picky with these) etc.

How are objects created and destroyed with autoboxing - what’s going on beneath the covers in the virtual machine.

When I make a call to a JNI function that creates object, how is this different than if they were created in my native Java code.

If I wanted to create lots of short lived objects yet have a large eden space (because maybe I do lots of operations that require object instantiation for some reason), why are the pause times growing since those objects should pretty much be going away immediately.

Is there a way to organize collections such that the objects you put in collections can set in the young generation longer even though the collection itself may be tenured.

How do I configure all of these performance operations in webstart or in a Java applet.

I could sit all day with someone while trying to do performance tuning and ask a million questions even though I’ve read through the available materials.

Jeff · January 16, 2004, 8:21pm

Nothing to my knowledge. The VM lays down data byte by byte according to the Java spec. You need to format
the data yourself if you are handing it to hardware that needs a given format.

Nothing. Thsi is purely a compiler thing. Syntactic sugar.

Its the same code underneath. The only issues being those of locking as with any JNI manipultion of Java objects.

Got me, Id need to see some numbers and the test and we could run it by the VM guys.

The young generation gets celaned up when the eden fills. You can increase the eden size with command line parameters. The flip side is that the bigger the eden the more it may have to potentially moe when an eden clean DOES happen.

The VM itself also adjusts the eden size on the fly trying to maintaina good balance. Im not sure if there are tweakable nobs on that, like there are in the -XX flags but also likely they’re meanings ae nto intuitively obious. Again I can ask the Vm guys for mreo info if you want.

Can’t yet. They are working on adding command line flag support to Webstart. Applets rally have no place to add that

Well, ask them here and I’l ltry to find answers wher I cna. Also, if you can get to JavaOne, the BOF sessions are great for questions like this because you get direct access to the developers.

In particulqr we are going to have game BOF at JavaOne. If you guys would like to giv eme a list of the grousp you’ld like to see ther to answer questions I can try to twist some arms into showing up

gregorypierce · January 17, 2004, 2:13pm

I will most likely make it to JavaOne this year - as soon as Sun puts up some information on pricing I will get someone to put in the request… EARLY.

Now here is the interestion question I had. Since according to the papers, garbage collection ONLY happens when a generation fills up (when dealing with the generational collectors). What stops me from therefore just creating one gigantic eden space of like 32MB. Yes that space will suffer from heap fragmentation over time, but if I know that I can fit all of my data for a level into 32MB and then just force the collection at the end of the level or when a player dies using System.gc() - this seems to be ideal for a game and shouldn’t require the need for any GC at all. After all that’s what we used to do on consoles, pack it all in and save some temporary heap space for some dynamic things. Then at the end of the level we could just dump it all.

On a console this would have been more problematic, but on the PC where I know I have at least 64MB available - I should be able to just create one big-ass eden space where all of my objects are likely to fit and garbage collection should never happen - right? That’s more in tune with the garbage collection strategy that a game or other multimedia application is likely to need.

Jeff · January 17, 2004, 11:57pm

[quote]I will most likely make it to JavaOne this year - as soon as Sun puts up some information on pricing I will get someone to put in the request… EARLY.

Now here is the interestion question I had. Since according to the papers, garbage collection ONLY happens when a generation fills up (when dealing with the generational collectors).
[/quote]
AIUI thats not quite true. Its true for the eden. its NOT true for the trains which are incrementally collected. Thast why its called an “inceremtnal gc” when you turn them on

Nope you want exactly the opposite in that situation. I explained this in the other thread on this.

gregorypierce · January 18, 2004, 4:07am

Hmmm… okay. I guess I didn’t understand the paper as well as I thought. Since I want my objects to reach tenure faster - what’s the best way to do this? The reason I was thinking the opposite is that I want to gc to run as infrequently as possible. Since the gc would only run when eden was full - seemed to make sense to just increase the size of eden and let my objects sit there. If the gc is only going to run when its full I would have expected that it would just be looking at the numbers of bytes allocated versus the eden size to know whether or not it needed to do anything, but clearly its doing ‘something’ and I’m not sure what its doing.

Essentially what I want to do is get my objects into a position where the VM isn’t really wasting any time to collect them because quite simply I know they don’t need to be collected. The issue with the gc in these scenarios is that it is counter to all of the gc algorithms. I understand my memory needs better than the gc, and there is no way to tell the gc - “you don’t need to run ever n milliseconds because you’re just wasting your time and wrecking my performance”.

Now that I’ve described basically what I’m after (I want to preallocate 80-90% of the objects in a level that I KNOW I’m going to need), what’s the best gc strategy for that since it doesn’t seem to be one that’s covered. All of the existing gc strategies seem to assume a memory graph that’s the exact opposite of what I need to do. I don’t really have alot of dynamic memory needs - I don’t need to wait for my objects to take the train to tenure - I don’t need the gc to even run very often as most just about everything that I put into memory will be staying there. I’m not planning on creating lots of temporary objects, I don’t apparently need to worry about compacting the heap - so I just need a way to get the gc to work with a memory graph that the current VM strategies don’t seem to cover at all.

princec · January 18, 2004, 10:33am

You want a very small Eden and an incremental collector, and at opportune moments, to give the GC a hint when to collect with a System.gc() - at the end of the level.

Cas

Jeff · January 18, 2004, 8:33pm

Thansk, frogot to mention System.gc.

In most cases today using System.gc() is harmful, but this is exactly the kidn of situation it is intended for, when you want to tell the VM, “I’d really like you to spend some significant time gc-ing right now!”