Concrete benefits of SSE and SSE2 instructions in Java

TheAnalogKid · August 30, 2005, 6:25pm

Hi all,

I’ve done multiple searches in the forums about SSE and SSE2 to know the concrete benefits of using these instruction sets. I know that FP computations have performance boots using them but I’m confused about how it could boost graphics performance since FPUs already do graphcis accelerations. Could someone clarifies all this please?

Thanks!

Linuxhippy · August 30, 2005, 9:37pm

I do not really understand this. what do you mean by saying that FPUs do graphic accerlation? A FPU is a unit in the processor and has nothing to do with the graphic card at all - no impact wether you have a GF7800 or a Tseng-board in your computer. (btw. does anybody remember the Tseng-VGA boards?)

However to come back to SSE:
SSE is a SMID instruction set which means Single-Instruction-Multiple-Data, simply means do one instruction on more-than-one data in one step. SSE allows e.g. to do 4 multiplications in one instruction.
If hotspot detects that it can optimize code to use SSE, the resulting code will need less instructions - thats it.

However this mostly is important for maths/algorythmic code which todays games are mostly not.

lg Clemens

TheAnalogKid · August 30, 2005, 11:41pm

[quote]I do not really understand this. what do you mean by saying that FPUs do graphic accerlation?
[/quote]
Oops! I did a typo here! : I meant GPU of course. I already know that FPU is not related to grapghics acceleration at all.

[quote]SSE is a SMID instruction set which means Single-Instruction-Multiple-Data, simply means do one instruction on more-than-one data in one step. SSE allows e.g. to do 4 multiplications in one instruction.
If hotspot detects that it can optimize code to use SSE, the resulting code will need less instructions - thats it.
[/quote]
I know what are SIMDs but I was wondering how a game could take benefits of it. What are concrete uses?

Jeff · August 31, 2005, 5:06am

Well… lesee

On a system that doesnt do triangle transform on the GPU it helps there, but ofcourse you’re right thats becoming less and less the case.

Physics however is a current big sucker of computation power.

TheAnalogKid · August 31, 2005, 12:46pm

Thanks Jeff!

[quote]On a system that doesnt do triangle transform on the GPU it helps ther
[/quote]
But I guess that it’s unlikely that this system will have a CPU that has at least SSE instructions. And MMX is not very useful as it doesn’t allow a program to use floating point primitives without loosing the perfornance boost of SIMDs. But what about 3DNow on AMD CPUs? I don’t know. I think it’s been there before SSE.
Now I understand how it can boost the performance in video games. And talking about game physics, do you know if eventually we could see physics hardware accelerators embeded in video cards?

Linuxhippy · August 31, 2005, 2:19pm

Not really - since it would not “fit” well into a GPU. Game physics is often coupled very closly to the game-engine and so needs to communicate a lot with RAM/CPU which is something GPUs are not good in, since there is a lot of communication overhead between.

lg Clemens

TheAnalogKid · August 31, 2005, 4:13pm

Yes very good point!

Raghar · August 31, 2005, 6:07pm

if you’d do

for …
array[c] = array2[c] * array3[c];

It should speed your application considerably especially if arrays are 16 byte boundary aligned.

So it’s also important in video compression, various simultaneous tasks, and for cycle expansion

darkprophet · August 31, 2005, 6:11pm

Not really…Take a look at ODE, Newton, Tokoman, Novodex (they actually have implemented Hardware physics like you describe), TrueAxis, Havok, MathLib et al…They all have decoupled the game engine from the physics. Infact i’l go as far as to say that its considered wrong design for a physics engine to be dependant on any game engine, because it doesn’t need to.

Linuxhippy · August 31, 2005, 6:22pm

Did not know about that at all however I just wonder which benefits it would have to have a physic-instruction set inside the GPU - wouldn’t it from design fit much better into the CPU if someone really wants to implement it in HW?
On the contrary I didn’t even know that physics consumes so much cycles these days so I am everything but experienced in terms of game programming

lg Clemens

darkprophet · August 31, 2005, 6:24pm

http://www.ageia.com/ - Enjoy a Hardware based physics engine, that can also run on the CPU if the PSU isn’t found.

tom · August 31, 2005, 6:49pm

The vm might be using SSE2 instruction instead of the FPU. That is if they can make it run faster on single data (not SIMD). SSE have an advantage since the registers is not stack based. Maybe Jeff can ask the vm guys if SSE is used?

I find it extremely unlikely that the vm can produce SIMD code. If they do it is only in very special cases. So you will not see any benefits of SIMD in java. You could write a native library that take advantage of SSE. But it is only benefitial if there is enough data that can overcome the JNI overhead.

I’m sure SIMD instructions can be used whenever you need to do some serious number crunshing. In games that might be:
-Sound (software mixers, softsynths, special fx)
-Physics
-AI
-Vertex manipulations, like in some shadow algorithms?

TheAnalogKid · August 31, 2005, 7:27pm

SSE and SSE2 instrictions are used when available by the server VM since 1.4.2. See SDK doc: http://java.sun.com/j2se/1.4.2/changes.html#vm

[quote]I’m sure SIMD instructions can be used whenever you need to do some serious number crunshing. In games that might be:
-Sound (software mixers, softsynths, special fx)
[/quote]
I know a lot of things about sound but don’t you think that DSPs are better suited for this kind of task?

ajiva · September 7, 2005, 7:35pm

SSE and SSE2 instructions are used on single data (so no SIMD stuff here). The biggest advantages are as follows (at least for the VM)

More registers (albeit single and double precision fpu registers)
Don’t use the FPU stack (except for trig/transcendentals)

Those above alone are worth quite a bit on FPU heavy programs. Easing of register pressure especially on register starved Intel CPUs is a big win.

TheAnalogKid · September 7, 2005, 8:11pm

Sorry but according to wikipedia SSE and SSE2 are actually SIMDs:

SSE:

[quote]SSE (Streaming SIMD Extensions) is a SIMD (Single Instruction, Multiple Data) instruction set designed by Intel, and introduced in their Pentium III series processors as a reply to AMD’s 3DNow! debuted a year earlier.
[/quote]
SSE2:

[quote]SSE2 is one of the IA-32 SIMD instruction sets, designed by Intel. It extends the earlier version SSE instruction set, and is intended to fully supplant MMX.
[/quote]
So who is right?

tom · September 7, 2005, 9:28pm

Yes, SSE and SSE2 is ofcourse SIMD instructions. I’ve not used it so I don’t know the details of how you load the registers etc. But nothing prevents you from only using a single element of data. Even though the instruction is used on multiple elements, you just ignore the results of the elements you don’t use.

ajiva · September 8, 2005, 4:18pm

Right, but the VM doesn’t use the SIMD portion of SSE/SSE2. You can use SSE/SSE2 registers and instructions on single data.

Raghar · September 8, 2005, 6:00pm

SIMD mean single instruction multiple data. If you’d fill rest of the data by 0 you’d compute single instrucion single data. However Intel SSE2 doesn’t work this way.

SSE and SSE2 instructions have two types of instructions for majority of work. One is for full xmm register, the other is just for data element at least important place. Like XXXO (O is that computed data element.) Of course some instructions aren’t exactly computation intensive, like XOR, AND, thus they are done always on full SSE2 register. SQRT and div are not as friendly to CPU, so there are IIRC 3 types of such instructions, One for exact, the second for fast aproximation, the third is for, in most situations, nearly exact result.

The biggest benefit of SSE2 instructions is freeing mmx registers for integer only work. Namely boolean operator work, and a scratchpad work, without needing to mess with CPU state with EMMS instruction. If all FPU is done on SSE2 registers, then mmx register state should never change thus no accidental stalls, and a nice 8 64 bit registers on a 32 bit computer. It also reduces polution of L1 cache. Note however that latency might be higher when accesing xmm registers than when accessing standard registers. And of course there is the problem with the memory aligned/unaligned loading. (Aligned is twice faster than unaligned, they should be ideally at nearly same latency.)

(I hope that above short introduction into SSE2 is without too many errors. I didn’t verified it with Intel manuals.)

I very don’t recomend to take names like, SIMD, or vector instructions, too literally. They are often used just for marketing purposes. For example SSE2 intructions might be sometimes refered to as vector instructions however I never seen command like “DOT” or “normalize” in Intel’s documentation, and nobody have serious need for them. (Yes I know this missnomer originated from a math and attempts to unneccessary import math terms into other areas.)
Also note that wikipedia isn’t exactly better resource than Intel programs for explaining work on SSE registers, and Intel manuals for P4 family SSE3 assembly instructions.

BTW Azeem Jiva
Is JVM able to reduce cache polution? For example r/w to volatile members should evade cache completely (on multiple CPU computer). And what about prefetching?

Raghar · September 9, 2005, 4:50pm

Actually xmm registers have 128 bit size. It’s unimportant if there are 2 x 64 bit FP data, or 4x32 bit ints. So it’s somewhat misleading to call them FPU registers.
Look at instruction like paddd xmm1, mem

ajiva · September 14, 2005, 3:01pm

Alright so they aren’t FPU registers I think of them as that, but your right you can put anything you want in them.