SIMD mean single instruction multiple data. If you’d fill rest of the data by 0 you’d compute single instrucion single data. However Intel SSE2 doesn’t work this way.
SSE and SSE2 instructions have two types of instructions for majority of work. One is for full xmm register, the other is just for data element at least important place. Like XXXO (O is that computed data element.) Of course some instructions aren’t exactly computation intensive, like XOR, AND, thus they are done always on full SSE2 register. SQRT and div are not as friendly to CPU, so there are IIRC 3 types of such instructions, One for exact, the second for fast aproximation, the third is for, in most situations, nearly exact result.
The biggest benefit of SSE2 instructions is freeing mmx registers for integer only work. Namely boolean operator work, and a scratchpad work, without needing to mess with CPU state with EMMS instruction. If all FPU is done on SSE2 registers, then mmx register state should never change thus no accidental stalls, and a nice 8 64 bit registers on a 32 bit computer. It also reduces polution of L1 cache. Note however that latency might be higher when accesing xmm registers than when accessing standard registers. And of course there is the problem with the memory aligned/unaligned loading. (Aligned is twice faster than unaligned, they should be ideally at nearly same latency.)
(I hope that above short introduction into SSE2 is without too many errors. I didn’t verified it with Intel manuals.)
I very don’t recomend to take names like, SIMD, or vector instructions, too literally. They are often used just for marketing purposes. For example SSE2 intructions might be sometimes refered to as vector instructions however I never seen command like “DOT” or “normalize” in Intel’s documentation, and nobody have serious need for them. (Yes I know this missnomer originated from a math and attempts to unneccessary import math terms into other areas.)
Also note that wikipedia isn’t exactly better resource than Intel programs for explaining work on SSE registers, and Intel manuals for P4 family SSE3 assembly instructions.
BTW Azeem Jiva
Is JVM able to reduce cache polution? For example r/w to volatile members should evade cache completely (on multiple CPU computer). And what about prefetching?