SSE with GCC vs Java Server VM

I really can’t get it to work, but well, I guess I can wait until 1.7 is out

After a little searching on the web to figure out how to look at the assembly that hotspot is generating I found a blog post that used teh flag -XX+PrintOptoAssembly and that works quite nicely. So if the printout via this flag is correct it looks like the code generated by hotspot is just using MMX for all the math which is why we have not noticed any improvement in application speed despite the SuperWord(SIMD) optimizations that are in 1.7. The run that I dumped to log actually went through the SuperWord optimizer twice, at least SuperWord dumped twice to the log via -XX:+TraceSuperWord, but it never generated any SIMD instructions. It seems like a waste to not use SIMD at all. Even in a worst case scenario with unaligned loads we would still see a little speed bump over MMX. Did someone at Sun forget to turn this bit of code all the way on? Please correct me if I am reading thsi wrong. I don’t know anything about Hotspot and I have not spent much time looking at assembly since college.

Here’s a bit of JIT. THere is a longer one that is basically just like this only longer, I assume from loop unrolling.


04e   	mulss   XMM1, [RAX + #24 (8-bit)]
053   	movss   [RAX + #24 (8-bit)], XMM1	# float
058   	movss   XMM0, [R9 + #28 (8-bit)]	# float
05e   	mulss   XMM0, [RAX + #28 (8-bit)]
063   	movss   [RAX + #28 (8-bit)], XMM0	# float
068   	movss   XMM1, [R9 + #32 (8-bit)]	# float
06e   	mulss   XMM1, [RAX + #32 (8-bit)]
073   	movss   [RAX + #32 (8-bit)], XMM1	# float
078   	movss   XMM0, [RAX + #36 (8-bit)]	# float
07d   	mulss   XMM0, [R9 + #36 (8-bit)]
083   	movss   [RAX + #36 (8-bit)], XMM0	# float
088   	incl    R11	# int
08b   	cmpl    R11, #1
08f   	jl,s   B2	# loop end  P=0.500000 C=1180672.000000

I dug a little further and it looks like even though the optimizer recognizes that this code should be SIMD’d it decides against it in the end. It rejects the arithmetic as unsupported, which it is but without directly debugging the C code I’m not sure why it thinks vector adds/muls/etc are not supported, and it thinks the vector loads/stores are not worth the effort for some reason. At this point I’d love to see the test code that was used to verify that this optimization actually works. Is this optimization too conservative to be useful or just buggy?

Why not ask hotspot developers directly? http://mail.openjdk.java.net/mailman/listinfo/hotspot-dev

Dmitri

I suppose that is the next course of action. From what I can tell the compiler doesn’t know how to generate ADDPS/MULPS/SUBPS/DIVPS opcodes. That would be the reason for the vector arithmetic being rejected as being unsupported by SuperWord. It looks like the compiler can use MOVAPS but appears to only support moving data from mmx register to mmx register so that probably is the reason that the vector moves are rejected as unproductive. I probably just missed something obvious but it looks like no one added the arithmetic SIMD opcodes to the compiler.

I’m guessing that a response might not arrive until after the holidays so I’m going to try and figure out how to add the opcodes to hotspot myself. Unless I’m totally wrong, and I certainly hope I am, the SuperWord optimization won’t actually do anything useful on a x86 cpu.