I really can’t get it to work, but well, I guess I can wait until 1.7 is out
After a little searching on the web to figure out how to look at the assembly that hotspot is generating I found a blog post that used teh flag -XX+PrintOptoAssembly and that works quite nicely. So if the printout via this flag is correct it looks like the code generated by hotspot is just using MMX for all the math which is why we have not noticed any improvement in application speed despite the SuperWord(SIMD) optimizations that are in 1.7. The run that I dumped to log actually went through the SuperWord optimizer twice, at least SuperWord dumped twice to the log via -XX:+TraceSuperWord, but it never generated any SIMD instructions. It seems like a waste to not use SIMD at all. Even in a worst case scenario with unaligned loads we would still see a little speed bump over MMX. Did someone at Sun forget to turn this bit of code all the way on? Please correct me if I am reading thsi wrong. I don’t know anything about Hotspot and I have not spent much time looking at assembly since college.
Here’s a bit of JIT. THere is a longer one that is basically just like this only longer, I assume from loop unrolling.
04e mulss XMM1, [RAX + #24 (8-bit)]
053 movss [RAX + #24 (8-bit)], XMM1 # float
058 movss XMM0, [R9 + #28 (8-bit)] # float
05e mulss XMM0, [RAX + #28 (8-bit)]
063 movss [RAX + #28 (8-bit)], XMM0 # float
068 movss XMM1, [R9 + #32 (8-bit)] # float
06e mulss XMM1, [RAX + #32 (8-bit)]
073 movss [RAX + #32 (8-bit)], XMM1 # float
078 movss XMM0, [RAX + #36 (8-bit)] # float
07d mulss XMM0, [R9 + #36 (8-bit)]
083 movss [RAX + #36 (8-bit)], XMM0 # float
088 incl R11 # int
08b cmpl R11, #1
08f jl,s B2 # loop end P=0.500000 C=1180672.000000
I dug a little further and it looks like even though the optimizer recognizes that this code should be SIMD’d it decides against it in the end. It rejects the arithmetic as unsupported, which it is but without directly debugging the C code I’m not sure why it thinks vector adds/muls/etc are not supported, and it thinks the vector loads/stores are not worth the effort for some reason. At this point I’d love to see the test code that was used to verify that this optimization actually works. Is this optimization too conservative to be useful or just buggy?
Why not ask hotspot developers directly? http://mail.openjdk.java.net/mailman/listinfo/hotspot-dev
Dmitri
I suppose that is the next course of action. From what I can tell the compiler doesn’t know how to generate ADDPS/MULPS/SUBPS/DIVPS opcodes. That would be the reason for the vector arithmetic being rejected as being unsupported by SuperWord. It looks like the compiler can use MOVAPS but appears to only support moving data from mmx register to mmx register so that probably is the reason that the vector moves are rejected as unproductive. I probably just missed something obvious but it looks like no one added the arithmetic SIMD opcodes to the compiler.
I’m guessing that a response might not arrive until after the holidays so I’m going to try and figure out how to add the opcodes to hotspot myself. Unless I’m totally wrong, and I certainly hope I am, the SuperWord optimization won’t actually do anything useful on a x86 cpu.