A while back some people here attempted to implement a Java SIMD’s library, not sure how far along they got, but came across a pretty cool looking library called Yeppp! which has a Java binding and claims to be a really fast vector maths library with SIMD’s support, more details here.
Could do with benchmarking vs JOML…
Cas
Well, that about wraps it up for Yeppp! eh :
Cas
Yep! =PPPP
I case anyone missed it:
haven’t had time to investigate to what level.
Wow, I did miss it
It’s very interesting! Finally some loop-autovectorization.
Do you know whether the JVM will use packed vector functions for operations on “adjacent” float fields, such as when doing Vector4f.add(Vector4f) in JOML?
It would be very very cool if HotSpot packs every four adjacent float fields into all 4 channels of an XMM register. But I guess it will never do this, since it would then have to do all sorts of shuffling single values down to the first channel and use scalar operations when those fields are not accessed in that pattern of 4 identical operations.
My guess is that it’ll currently only happen for very simple patterns, but that’s a guess. For fun I just tested these 2 methods:
/** does not generate addps */
private static void sump4s(float[] d, float[] a, float[] b)
{
int len = d.length>>2;
int i = 0;
while(i<len) {
d[i] = a[i]+b[i]; i++;
d[i] = a[i]+b[i]; i++;
d[i] = a[i]+b[i]; i++;
d[i] = a[i]+b[i]; i++;
}
}
/** does not generate addps */
private static void sump4sa(float[] d, float[] a, float[] b)
{
int len = d.length>>2;
int i = 0;
while(i<len) {
d[i ] = a[i] +b[i ];
d[i+1] = a[i+1]+b[i+1];
d[i+2] = a[i+2]+b[i+2];
d[i+3] = a[i+3]+b[i+3];
i += 4;
}
}
no ‘addps’ means all scalar computations.
[quote=“Roquen,post:7,topic:55513”]
I couldn’t reproduce that on 8u60 x64, Windows 10, Sandy Bridge. Could you share details about how you tested and the full code?
Code is here:https://github.com/roquendm/JGO-Grabbag/blob/master/src/roquen/info/HsIntrinsics.java
8u60 x64 win 7 on a really old icore-2 netbook cpu. I’ll post the asm later…I’m assuming you’re not seeing any packed ops?
EDIT: from the top of log - Java HotSpot™ 64-Bit Server VM (25.60-b23) for windows-amd64 JRE (1.8.0_60-b27), built on Aug 4 2015 11:06:27 by "java_re" with MS VC++ 10.0 (VS2010)
Seems to be sensitive to the array size. Using your code but with 10000 iterations of the external loop (vs 10 originally):
- SIZE = 1<<8: always scalar
- SIZE = 1<<9: sometimes scalar, sometimes packed
- SIZE = 1<<10: always packed
Anyway, confirmed, it works.
also check what compiler is generating the asm…
EDIT: probably hitting the threshold between C1 & C2 is what I mean.
EDIT 2: if edit 1 is correct assumption…it’s not the array size is not important is what I really, really mean…just the code hasn’t run enough.
Ah, right, always forget tiered compilation is enabled by default now. The scalar code was indeed compiled by C1, with -XX:-TieredCompilation I’m always getting packed instructions now.
Darn… that means: rewriting JOML to store matrices and vector components in float[] arrays instead of primitive fields, and perform all operations with fixed-length loops just to get the benefit of packed ops.