Carmack did it again

[quote=http://www.gamasutra.com/php-bin/news_index.php?story=14979]And as far as Java is concerned, he called it a “good attempt at making something run at a tenth of the speed it should.”
[/quote]
Of course he was taking about Java on mobile phones on the QuakeCon 2007…

I know this sounds weird, but he’s not that far off though…

With the lack of SIMD support, we run up to factor 4 slower for vector-math with the last generation of x86 CPUs.

My RFE got closed, and I didn’t notice any major boost in 6u2, which according to this other RFE should have happened. (Right?)

I see what they’re saying in regards to possible architecture differences, but it is important for multimedia applications. And when using the pipeline well you get even greater performance gains, for example alpha blending is much more than 4 times faster when the pipeline is utilized well.

good point Riven,

but do you think more than 50% of CPU consuming code in a game uses SMID? Otherwise the factor should shrink…

Anyway, I also would prefer hotspot using SMID instructions wherever possible/useful instead of having an API. There reason are:

  • I’m lazy: even there is only a speed up of 2x for a dedicated method and I could have achieved 4x with the API, it is likely that I also get an improvement in others places, I would not have used SMID calls explicitly…
  • Such an API probably changes frequently, which is why it is not a good idea to put it in the core library.*

*game developers usually don’t care, because most changes after one is done are bug-fixes, so one can stick to the old one (like DirectX) - that’s why IMHO a temporary API com.sun.xxx would be great.

Regarding the “other RFE”:

I remember asking about the details on a java.net hosted blog (sorry don’t rember which one):

IIRC it was a documentation bug, more precise: one of the jvm developers identified a single/few circumstances where SMID can be used and did an implementation. Afterwards, he/she filled the rfe, to document his/her work. This is a step in the right direction, but yet far away from trying to use SMIDs instruction wherever possible - which is what the title may suggest.

Carmack’s mainly talking about array access - instead of using MIDP’s image rendering routines, a 3D game on MIDP 2 (without OpenGL, etc) need to draw each pixel one at a time by setting a value in an array, and the copying that array to display memory. There’s two slow things here:

  1. A null check and a array bounds check every time you set a value in an array. Slow! HotSpot can sometimes eliminate those checks, but the VMs on these mobile devices aren’t that great.
  2. For MIDP 2, first you’ve got to render to a 24-bit color array, and then call the drawRGB() method, which converts the aarray to whatever color format the device uses - which is usually 16-bit color. Color conversion? Slow!

Compared to the original DOOM, which wrote pixel data directly to VGA memory, the MIDP way of doing things is, indeed, slow.

Speaking in general and not just for mobile games, it would be much more useful if the Java language supported vectors and matrices as primitive data types, but we know that will never happen.

In the long run (and possibly the next 5-10 years) it would probably become obsolete for most of its uses when processors are more like the cell - with multiple SIMD processors. But for the current situation it is an issue!

I think the gap will only widen, as there is no support for such technology in the generated bytecode.

This future is wrapping native libraries, I think.

For the gap to continue to widen the programming complexity of applications would have to increase to be able to create a process where the time spent in logic corresponds to the increase in processing power of the hardware. I suspect that in the next 10 or so years our programming tasks are likely to involve more processing of data with less programming logic. The hardware is going that way, so maybe the programming approach will follow too.

But for the meantime until that change I do agree with wrapping native libraries, especially since that’s just a possibility/idea.

Why would this (for example SIMD) support be needed in the generated bytecode in order to keep up, and not in the JIT? I’d say it’s up to the JIT to use whatever technology is present in the hw…

I’ve read some EA documents, stating how auto-vectorization completely failed performance wise - they invested six months in it, with a group of experts in the field. Even C/C++ compilers fail to get most out of SSE2 instructions. Compilers generating code near 80% of optimal was considered an utopia.

Seeing how long it took for the HotSpot JIT to reach the current level of performance, and how fragile it still is, causing you to hand-tune sourcecode by trial and error to optimize for a specific JIT (to squeeze out the last… 200% :o, should we call it the SweetSpot JIT ;)), I’m fairly sure auto-vectorization will never ever be able to compete with hand-written instructions.

We’re moving to stream-processing anyway, with kernels processing them.

Probably, in the future, we couldn’t care less about SSE2 performance, as everything related to vector-math is executed on specialized chips anyway (GPU-like cores), like keldon85 mentions

six months at ea doesn’t impress me tbh.

feature x will probebly never be able to compete with hand-written instructions, where have I heared that before?

It still doesn’t :slight_smile: Of course the amount of places where that matters is shrinking due to complexity. The time critical tasks of today are no more complex than they were 20 years ago, so the speed gain benefits offered from lower level optimisations drops every couple of years. That is why I talk of the long term, I mean I’ve spent some time coding for the GBA and you don’t even want to touch garbage collection, generalization or dynamic type casting because it shatters your speed. One assembler written in assembler compiles much quicker than anything written in C, and his search speeds trample on everything the commercial © world has to offer - not because he optimized heavily but because he was able to make better decisions.

Also the size of programs has grown too large for a complete assembler solution to be viable; the same can be said for C, and maybe soon C++!

It’s like scripting languages, you just wouldn’t code your ray caster in one but game logic is fine!

Getting off the high horse, we could go the “Mono way”: SIMD support in Mono

I like their way of bytecode - to - GL(SL) conversion with attributes (annotations).

[edit] asm looks like cool library doing this in java [/edit]

I beleave there was a java.net project that reconised certain patterns(you could also help it along) and optimise the hell out of them, I was gonna fiddle with it, but I can’t seem to find the bookmark :confused:

Hmmm how could that library help? From what I understand it compiles javasource to bytecode? For SIMD or other lowlevel stuff the JVM would need to support it, or you mean by redirecting specific calls to a JNI library?

Sounds interesting! Maybe you can remember a name or other reference? Would like to try it out too!

yes redirecting can be done, if could one usees certain instructions/library calls.

But wouldn’t the JNI calls give to much overhead for just vector operations to be worthwhile? Or are there any projects already taking this approach?

You are absolutely right. To replace single instructions by JNI calls would be many times slower.

Either the JVM should be modified, or quite large pieces of code would be natively compiled into a DLL (at runtime) so you’d need to ship a C/C++ compiler.

I never talked about many JNI calls

the idea is if the method uses only certain instructions and apis (with vectors4 and matrix4x4), it can an intermediate, pre-compiled language like GLSL, a JNI lib (e.g. jogl) can execute the code instead of the java one. Of course you can build your own simple vector language and with compiler using SIMD and a JN-lib to compile and call the script.

=> Fully cross platform, scales with time.

using instrumentation (and asm) you can easily extract the method’s code, transform and replace it …

instrumentation can be used each time a vm starts or as a compile step.

in the first case, a c++/glsl/… compiler is needed on the users pc, but one can do additional optimizatiosn as one knows the exact cpu (#cores, instructions set,…)
in the latter one, you simply put multiple versions up: e.g. standard, intel-performance-pack, opengl-pack, …