After rearranging code a bit (to put loops inside their methods, so it will be easier to see what is happening) and simplifying loops (hotspot was doing everything on registers after few first memory accesses, plus it was again to much code to check), I have come up with following
private static void testB(int classes) {
for(int i = 0; i < loop; i++) {
ClazzB b = cB[i % classes];
b.c(b.a() + b.b());
b.z(b.x() + b.y());
}
}
private static void testA(int classes) {
for(int i = 0; i < loop; i++) {
ClazzA a = cA[i % classes];
a.c = a.a + a.b;
a.z = a.x + a.y;
}
}
and generated code for both methods (inner loop only, rest is no important) is
testA
080 B6: # B10 B7 <- B5 B8 Loop: B6-B8 inner stride: not constant Freq: 7.51066
080 MOV EAX,ESI
082 MOV ECX,[ESP + #4]
086 CDQ
IDIV ECX
0a0 CMPu EDX,EBX
0a2 Jge,us B10 P=0.000001 C=-1.000000
0a2
0a4 B7: # B11 B8 <- B6 Freq: 7.51065
0a4 MOV EBP,[EDI + #12 + EDX << #2]
0a8 MOV EDX,[EBP + #8]
0ab NullCheck EBP
0ab
0ab B8: # B6 B9 <- B7 Freq: 7.51065
0ab MOVSS XMM0a,[EBP + #24]
0b0 MOVSS XMM2a,[EBP + #20]
0b5 MOV ECX,[EBP + #12]
0b8 ADDSS XMM2a,XMM0a
0bc MOVSS [EBP + #28],XMM2a
0c1 ADD EDX,ECX
0c3 MOV [EBP + #16],EDX
0c6 INC ESI
0c7 CMP ESI,#1048576
0cd Jlt,s B6 P=1.000000 C=7.509333
0cd
and testB
1b0 B18: # B35 B19 <- B17 B32 Loop: B18-B32 inner stride: not constant Freq: 14.1767
1b0 MOV EAX,EBX
1b2 MOV ECX,[ESP + #28]
1b6 CDQ
IDIV ECX
1d0 CMPu EDX,[ESP + #36]
1d4 Jge,u B35 P=0.000001 C=-1.000000
1d4
1da B19: # B34 B20 <- B18 Freq: 14.1766
1da MOV EDI,[ESP + #32]
1de MOV ECX,[EDI + #12 + EDX << #2]
1e2 MOV [ESP + #8],ECX
1e6 MOV ECX,[ECX + #8]
1e9 NullCheck ECX
1e9
1e9 B20: # B64 B21 <- B19 Freq: 14.1766
1e9 TEST ECX,ECX
1eb Jlt B64 P=0.000000 C=6.667333
1eb
1f1 B21: # B62 B22 <- B20 Freq: 6.66733
1f1 CMP ECX,[ESP + #60]
1f5 Jge B62 P=0.000000 C=6.667333
1f5
1fb B22: # B59 B23 <- B21 Freq: 6.66733
1fb MOV ESI,ECX
1fd INC ESI
1fe MOV EDI,ECX
200 SHL EDI,#2
203 MOV EBP,[ESP + #4]
207 ADD EBP,EDI
209 MOV EDX,[EBP]
20c TEST ESI,ESI
20e Jlt B59 P=0.000000 C=6.667333
20e
214 B23: # B57 B24 <- B22 Freq: 6.66733
214 CMP ESI,[ESP + #60]
218 Jge B57 P=0.000000 C=6.667333
218
21e B24: # B54 B25 <- B23 Freq: 6.66733
21e MOV EAX,[EBP + #4]
221 ADD EAX,EDX
223 MOV EDX,ECX
225 ADD EDX,#2
228 TEST EDX,EDX
22a Jlt B54 P=0.000000 C=6.667333
22a
230 B25: # B52 B26 <- B24 Freq: 6.66733
230 CMP EDX,[ESP + #60]
234 Jge B52 P=0.000000 C=6.667333
234
23a B26: # B49 B27 <- B25 Freq: 6.66733
23a MOV [EBP + #8],EAX
23d MOV EAX,ECX
23f ADD EAX,#3
242 TEST EAX,EAX
244 Jlt B49 P=0.000000 C=6.667333
244
24a B27: # B47 B28 <- B26 Freq: 6.66733
24a CMP EAX,[ESP + #56]
24e Jge B47 P=0.000000 C=6.667333
24e
254 B28: # B44 B29 <- B27 Freq: 6.66733
254 MOV EBP,[ESP + #0]
257 ADD EBP,EDI
259 MOVSS XMM0a,[EBP + #12]
25e MOV EAX,ECX
260 ADD EAX,#4
263 TEST EAX,EAX
265 Jlt B44 P=0.000000 C=6.667333
265
26b B29: # B42 B30 <- B28 Freq: 6.66733
26b CMP EAX,[ESP + #56]
26f Jge B42 P=0.000000 C=6.667333
26f
275 B30: # B39 B31 <- B29 Freq: 6.66733
275 MOVSS XMM2a,[EBP + #16]
27a ADDSS XMM2a,XMM0a
27e ADD ECX,#5
281 TEST ECX,ECX
283 Jlt,s B39 P=0.000000 C=6.667333
283
285 B31: # B37 B32 <- B30 Freq: 6.66733
285 CMP ECX,[ESP + #56]
289 Jge,s B37 P=0.000000 C=6.667333
289
28b B32: # B18 B33 <- B31 Freq: 6.66733
28b MOVSS [EBP + #20],XMM2a
290 INC EBX
291 CMP EBX,#1048576
297 Jlt B18 # Loop end P=1.000000 C=7.509333
I’m quite surprised that speed difference is only about 1.4 (on my AMD machine with short loop).