Yeppp! fast vector math library

kappa · September 17, 2015, 6:06pm

A while back some people here attempted to implement a Java SIMD’s library, not sure how far along they got, but came across a pretty cool looking library called Yeppp! which has a Java binding and claims to be a really fast vector maths library with SIMD’s support, more details here.

princec · September 17, 2015, 6:10pm

Could do with benchmarking vs JOML…

Cas

KaiHH · September 17, 2015, 6:21pm

already done

princec · September 17, 2015, 6:36pm

Well, that about wraps it up for Yeppp! eh :

Cas

theagentd · September 17, 2015, 6:53pm

Yep! =PPPP

elect · September 18, 2015, 7:18am

@KaiHH, you may wanna look this and this

Roquen · September 18, 2015, 9:58am

I case anyone missed it:

haven’t had time to investigate to what level.

KaiHH · September 18, 2015, 10:48am

Wow, I did miss it
It’s very interesting! Finally some loop-autovectorization.
Do you know whether the JVM will use packed vector functions for operations on “adjacent” float fields, such as when doing Vector4f.add(Vector4f) in JOML?
It would be very very cool if HotSpot packs every four adjacent float fields into all 4 channels of an XMM register. But I guess it will never do this, since it would then have to do all sorts of shuffling single values down to the first channel and use scalar operations when those fields are not accessed in that pattern of 4 identical operations.

Roquen · September 18, 2015, 11:16am

My guess is that it’ll currently only happen for very simple patterns, but that’s a guess. For fun I just tested these 2 methods:


  /** does not generate addps  */
  private static void sump4s(float[] d, float[] a, float[] b)
  {
    int len = d.length>>2;
    int i   = 0;

    while(i<len) {
      d[i] = a[i]+b[i]; i++;
      d[i] = a[i]+b[i]; i++;
      d[i] = a[i]+b[i]; i++;
      d[i] = a[i]+b[i]; i++;
    }
  }

  /** does not generate addps  */
  private static void sump4sa(float[] d, float[] a, float[] b)
  {
    int len = d.length>>2;
    int i   = 0;

    while(i<len) {
      d[i  ] = a[i]  +b[i  ];
      d[i+1] = a[i+1]+b[i+1];
      d[i+2] = a[i+2]+b[i+2];
      d[i+3] = a[i+3]+b[i+3];
      i += 4;
    }
  }

no ‘addps’ means all scalar computations.

Spasi · September 18, 2015, 3:31pm

[quote=“Roquen,post:7,topic:55513”]
I couldn’t reproduce that on 8u60 x64, Windows 10, Sandy Bridge. Could you share details about how you tested and the full code?

Roquen · September 18, 2015, 5:13pm

Code is here:https://github.com/roquendm/JGO-Grabbag/blob/master/src/roquen/info/HsIntrinsics.java

8u60 x64 win 7 on a really old icore-2 netbook cpu. I’ll post the asm later…I’m assuming you’re not seeing any packed ops?

EDIT: from the top of log - Java HotSpot™ 64-Bit Server VM (25.60-b23) for windows-amd64 JRE (1.8.0_60-b27), built on Aug 4 2015 11:06:27 by "java_re" with MS VC++ 10.0 (VS2010)

Spasi · September 18, 2015, 6:26pm

Seems to be sensitive to the array size. Using your code but with 10000 iterations of the external loop (vs 10 originally):

SIZE = 1<<8: always scalar
SIZE = 1<<9: sometimes scalar, sometimes packed
SIZE = 1<<10: always packed

Anyway, confirmed, it works.

Roquen · September 18, 2015, 6:32pm

also check what compiler is generating the asm…

EDIT: probably hitting the threshold between C1 & C2 is what I mean.
EDIT 2: if edit 1 is correct assumption…it’s not the array size is not important is what I really, really mean…just the code hasn’t run enough.

Spasi · September 18, 2015, 6:53pm

Ah, right, always forget tiered compilation is enabled by default now. The scalar code was indeed compiled by C1, with -XX:-TieredCompilation I’m always getting packed instructions now.

KaiHH · September 18, 2015, 6:59pm

Darn… that means: rewriting JOML to store matrices and vector components in float[] arrays instead of primitive fields, and perform all operations with fixed-length loops just to get the benefit of packed ops.