Once again! fast MappedObjects implementation

Edit: Latest version:

Usage:

@MappedType(sizeof = 12)
public class MappedVec3 extends MappedObject
{
   public float x;

   public float y;

   public float z; // you can hardcode this field's byte-offset using: @MappedField(offset=8)

   public float length()
   {
      return (float) Math.sqrt(x*x + y*y + z*z); // natural field access
   }

   @Override
   public String toString()
   {
      return "[" + x + "," + y + "," + z + "]";
   }
}

public class TestMappedObject
{
   public static void main(String[] args)
   {
      MappedObjectTransformer.register(MappedVec3.class);

      // we 'fork' the current application. the forked version will have transformed classes
      if (MappedObjectClassLoader.fork(TestMappedObject.class, args))
      {
         // if this is the version of the classloader without transformed classes, stop right here
         return;
      }

      // business as usual

      ByteBuffer bb = ByteBuffer.allocateDirect(4096);
      bb.order(ByteOrder.nativeOrder());
      MappedVec3 vecs = MappedVec3.map(bb);

      // naturally we start at mapping 0
      vecs.x = 1.30f;
      vecs.y = 2.50f;
      vecs.z = 3.60f;

      vecs.view = 1; // go to mapping at index 1
      vecs.x = 0.13f;
      vecs.y = 0.25f;
      vecs.z = 0.36f;

      float x0 = bb.getFloat(0 << 2);
      float y0 = bb.getFloat(1 << 2);
      float z0 = bb.getFloat(2 << 2);

      float x1 = bb.getFloat(3 << 2);
      float y1 = bb.getFloat(4 << 2);
      float z1 = bb.getFloat(5 << 2);
   }
}

Everything below this line is outdated…


I decided to give implementing MappedObjects another try, after the previous attempt went down the drain last year.

Let’s say you have this ‘struct’ in C (yes, my C-skills are somewhere below sea level):


struct Sphere
{
   float x, y, z, r;

   bool intersects(Sphere that)
   {
      float xd = this->x - that->x;
      float yd = this->y - that->y;
      float zd = this->z - that->z;
      float d2 = (xd * xd) + (yd * yd) + (zd * zd);
      float r = this->r + that->r;
      return d2 <= r * r;
   }
}

Currently you’d have to fiddle with FloatBuffers or float[]s with offsets and strides, which is extremely boring and error-phrone to code. I wrote an API that does this for you, which a nice layer over it so that is supports byte[]s, direct and heap ByteBuffers. Last but not least, it’s extremely quick! Working with float[]s instead of byte[]s gains you only 5-7% performance, which it probably a good tradeoff for 99.99% of us.

Unfortunately, we still need to tell Java which fields have which offset, so… the code to construct the MappedObject is a bit verbose. But as we define the abstract class (or interface), we can simply talk to the interface, so that we keep all those nice features in our IDEs.

In the Java implementation it becomes:


   @MappedType(stride = 16)
   public abstract class Sphere // or interface...
   {
      // sliding window
      public abstract void map(int element);

      // getters (with annotations)
      public abstract @MappedField(offset = 0,  type = float.class) float x();
      public abstract @MappedField(offset = 4,  type = float.class) float y();
      public abstract @MappedField(offset = 8,  type = float.class) float z();
      public abstract @MappedField(offset = 12, type = float.class) float r();

      // setters
      public abstract void x(float x);
      public abstract void y(float y);
      public abstract void z(float z);
      public abstract void r(float r);

      public boolean intersects(Sphere that)
      {
         float xd = this.x() - that.x();
         float yd = this.y() - that.y();
         float zd = this.z() - that.z();
         float d2 = (xd * xd) + (yd * yd) + (zd * zd);
         float r = this.r() + that.r();
         return d2 <= r * r;
      }
}

So much for the boiler-plate! Now it’s time to make a MappedObject over a byte[] or a ByteBuffer:


      MappedObjectProvider<Sphere> provider = new MappedObjectProvider<Sphere>(Sphere.class);

      int sphereCount = 128;
      ByteBuffer data1 = ByteBuffer.allocate(      provider.sizeof() * sphereCount);
      ByteBuffer data2 = ByteBuffer.allocateDirect(provider.sizeof() * sphereCount);
      byte[]     data3 = new byte[                 provider.sizeof() * sphereCount];
      
      Sphere sA = provider.newMappedObject(data1, 0, sphereCount); // datasource, base, elementCount
      Sphere sB = provider.newMappedObject(data2, 0, sphereCount);
      Sphere sC = provider.newMappedObject(data3, 0, sphereCount);

      sA.map(5);  // slide the MappedObject to element 5 (offset = base+5*stride)
      sB.map(13); // slide the MappedObject to element 13

      // sA.x = sB.x;
      sA.x(sB.x());

The sourcecode (and binaries) are here:
http://213.247.55.3/~balk1242/html_stuff/ (for browsable packages, with source in formatted HTML)
jawnae.mapped.PrimitiveConverter

Although I’m using sun.misc.Unsafe to write all primitives in the byte[] at ‘native speed’, it’s impossible to get a buffer overflow.

There are automatic optimisations when stride is power-of-two, which makes is roughly 40% faster.

Give it a try and let me known what you think!

Performance is best with byte[], direct bytebuffers are about 10% slower,
and for heap-bytebuffers I simply use the backing byte[], so they run at full speed too.

Needless to say you can finally have fields that overlap :persecutioncomplex:

for whatever that’s worth

I guess I should have titled this topic: structs for java with runtime class generation

I guess the biggest issue is that most people here (me included) don’t have a clue what it’s supposed to be good for. So maybe you could outline some specific real-life use case and how much of a performance improvement you managed to get.

Yup was thinking the same thing.

Hm… I thought this was a pretty well covered topic on these forums! Guess it’s not familiar territory to everybody :slight_smile:

It’s about not having 10.000 instances of classes like Vector3f, Matrix16f, Normal3f and TexCoord2f, because they are spread all over the heap, so you can’t really have good cache-locality and certainly cannot push all that data to the GPU in one call.

So now you have this ‘sliding window’, over your data, where you can pick/map a certain element on which you get/set your values or do your calculations.

Slow way, OO style:


class Vector3f
{
   public float x, y, z;
}

class Normal3f
{
   public float x, y, z;
}


Vector3f[] vectors = new Vector3f[1024];
Normal3f[] normals = new Normal3f[1024];
// ...initialize...


for(int i=0; i<vectors.length; i++)
{
   vectors[i].add(vectors[1024-i-1]);
   vectors[i].mul(vectors[1024-i-1]);
   normals[i].normalize();
}

// pump to FloatBuffer to update some VBO
FloatBuffer vec_norm_buf = ...;
int p = 0;
for(int i=0; i<vectors.length; i++)
{
   Vextor3f v3f = vectors[i];
   fbuf.put(p + 0, v3f.x);
   fbuf.put(p + 1, v3f.y);
   fbuf.put(p + 2, v3f.z);

   Vextor3f n3f = normals[i];
   fbuf.put(p + 3, n3f.x);
   fbuf.put(p + 4, n3f.y);
   fbuf.put(p + 5, n3f.z);

   p += 6;
}

You can also work directly on the FloatBuffer, but the code to write is just a heck of a lot boiler-plate, and a pain in the ass to modify (for example when you change your interleaved buffer from VectorNormal to VectorNormalTexcoord.


   @MappedType(stride = 24)
   public abstract class Vector3f
   {
      public abstract void map(int element);

      public abstract @MappedField(offset = 0,  type = float.class) float x();
      public abstract @MappedField(offset = 4,  type = float.class) float y();
      public abstract @MappedField(offset = 8,  type = float.class) float z();

      public abstract void x(float x);
      public abstract void y(float y);
      public abstract void z(float z);
  }

   @MappedType(stride = 24)
   public abstract class Normal3f
   {
      public abstract void map(int element);

      public abstract @MappedField(offset = 12,  type = float.class) float x();
      public abstract @MappedField(offset = 16,  type = float.class) float y();
      public abstract @MappedField(offset = 20,  type = float.class) float z();

      public abstract void x(float x);
      public abstract void y(float y);
      public abstract void z(float z);
  }


      MappedObjectProvider<Vector3f> vProvider = new MappedObjectProvider<Vector3f>(Vector3f.class);
      MappedObjectProvider<Normal3f> nProvider = new MappedObjectProvider<Normal3f>(Normal3f.class);
      byte[] data = new byte[(vProvider.sizeof() + nProvider.sizeof()) * 1024];

      Vector3f vMap1 = vProvider.newMappedObject(data, 0, 1024);
      Vector3f vMap2 = vProvider.newMappedObject(data, 0, 1024);
      Normal3f nMap = nProvider.newMappedObject(data, 0, 1024);

for(int i=0; i<1024; i++)
{
   vMap1.map(i);
   vMap2.map(1024-i-1);
   nMap.map(i);

   vMap1.add(vMap2);
   vMap1.mul(vMap2);
   nMap.normalize();
}

// no need to copy everything to a Buffer

Note the stride of 24 (6 floats), and offsets for the floats in Vector3f and Normal3f.

This is as near as you get to structs in Java.

Next step would be to add a bytecode transformer that allows you to call fields instead of getters/setters, (which are transformed into getters/setters again at runtime).

A simple Vector3f.add() would look like:


public void add(Vector3f that)
{
   this.x(this.x() + that.x());
   this.y(this.y() + that.y());
   this.z(this.z() + that.z());
}

With a bytecode transformer it would look like:


public void add(Vector3f that)
{
   this.x += that.x;
   this.y += that.y;
   this.z += that.z;
}

Anyway, there is no performance penalty, even when writing floats into byte[]s or ByteBuffers.

Hmm how did I miss this thread the last time.
Links at the top are broken - still got it all lying around somewhere?

Cas :slight_smile:

Not really. The data behind the links was lost when JGO got infected and I had to get a new VM instance. The actual sourcecode was deleted by me because I basically didn’t write it for myself, but for others here and it turned out nobody cared / understood. It shouldn’t take too long to write it again though.

Keep in mind it uses sun.misc.Unsafe (in a safe way) and Janino for the code generation. It’s basically a tiny layer over direct memory access (writing a float into a byte[], by doing a unsafe.putFloat(instance, offset, value) for example. with direct buffers I used memory addresses instead of offsets from objects) With short,char,int,long you get byte-order problems, but that’s pretty much guaranteed anyway when interfacing with raw memory.

I just had an enlightened moment (…) on how to enable the developer to write code using fields as opposed to those getters & setters. The bytecode transformations shouldn’t be too hard after that.

I watch with interest :slight_smile: Direct field access -> buffer is exactly what I’m looking for… whether that turns into an intrinsic move / load operation at the end of it is of course the holy grail and puts it on a par with pure C for performance.

Cas :slight_smile:

The overhead of Unsafe is significant. It’s about twice as slow as your regular Vector3f instance. (yay!)

At least, that’s what I recall from 2008.

Any real numbers on the difference between the “OO” way and the “struct” way?

Also when you say you can have overlapping offsets, do you mean C unions? (C unions are evil >:( ).

I had one trick when using the OO way. Every now and then re-allocate the entire set of objects in access order. This gave me a 2-4x speed increase on Intels and no increase or penalty on AMDs. But this was back in java 1.3 IIRC.

I just re-read that and was wondering how the abstraction and JNI affected performance. Sure you get the memory locality and fewer instances lying around, but it felt like you were adding a lot of extra work in reading/writing the “fields”.

The JVM got better since 2008 ;D

It’s still non-deterministic though, and you can have wildly varying performance between JVM runs… (it either takes 18us or 30us, after warmup, on my machine)

Here is a benchmark to give a rough indication of Unsafe performance:

package eden;

import java.lang.reflect.Field;
import java.nio.ByteBuffer;
import java.nio.ByteOrder;
import java.nio.FloatBuffer;
import java.util.Arrays;

import sun.misc.Unsafe;

public class MappedObject
{
   private static final long nanoTimeOverhead;
   static
   {
      nanoTimeOverhead = measureNanoTimeOverhead();
      System.out.println("Java version: " + System.getProperty("java.version"));
      System.out.println("JavaVM version: " + System.getProperty("java.vm.version"));
      System.out.println("nanoTimeOverhead: " + nanoTimeOverhead + "ns");
   }

   private static Unsafe grabUnsafeInstanceFromBuffer()
   {
      try
      {
         ByteBuffer dummy = ByteBuffer.allocateDirect(1);
         Field unsafeField = dummy.getClass().getDeclaredField("unsafe");
         unsafeField.setAccessible(true);
         return (Unsafe) unsafeField.get(dummy);
      }
      catch (Throwable t)
      {
         throw new Error(t);
      }
   }

   public static void main(String[] args)
   {
      Unsafe unsafe = grabUnsafeInstanceFromBuffer();

      //

      int vectors = 4 * 1024; // small, to keep it in cache
      int floats = vectors * 4;
      int bytes = floats * 4;

      //

      int runs = 256;

      for (int i = 0; i < 16; i++)
      {
         System.out.println();

         runFloatObjectBenchmark(vectors, floats, bytes, runs);
         runFloatBufferBenchmark(vectors, floats, bytes, runs);
         runFloatArrayBenchmark(vectors, floats, bytes, runs);
         runFloatPointerBenchmark(unsafe, vectors, floats, bytes, runs);
      }
   }

   static class FloatVector
   {
      public float x, y, z, w;
   }

   private static final String formatNanos(long nanos)
   {
      if (nanos > 10 * 1000000000L)
         return (nanos / 1000000000L) + "s";
      if (nanos > 10 * 1000000L)
         return (nanos / 1000000L) + "ms";
      if (nanos > 10 * 1000L)
         return (nanos / 1000L) + "us";
      return nanos + "ns";
   }

   private static final void printResults(String prefix, long[] durations)
   {
      System.out.println(prefix + ".best:   " + formatNanos(durations[(int) (durations.length * 0.00f)]));
      System.out.println(prefix + ".good:   " + formatNanos(durations[(int) (durations.length * 0.05f)]));
      System.out.println(prefix + ".median: " + formatNanos(durations[(int) (durations.length * 0.50f)]));
      System.out.println(prefix + ".bad:    " + formatNanos(durations[(int) (durations.length * 0.95f) - 1]));
      System.out.println(prefix + ".worst:  " + formatNanos(durations[(int) (durations.length * 1.00f) - 1]));
   }

   private static final void runFloatObjectBenchmark(int vectors, int floats, int bytes, int runs)
   {
      FloatVector[] a = new FloatVector[vectors];
      FloatVector[] b = new FloatVector[vectors];
      FloatVector[] dst = new FloatVector[vectors];

      // allocate the objects sequentially, per array
      for (int i = 0; i < vectors; i++)
         a[i] = new FloatVector();
      for (int i = 0; i < vectors; i++)
         b[i] = new FloatVector();
      for (int i = 0; i < vectors; i++)
         dst[i] = new FloatVector();

      long[] durations = new long[runs];
      for (int i = 0; i < durations.length; i++)
      {
         long t0 = System.nanoTime();
         testFloatObjectMulPerformance(a, b, dst, vectors);
         long t1 = System.nanoTime();
         long took = (t1 - t0) - 2 * nanoTimeOverhead;
         durations[i] = took;
      }

      Arrays.sort(durations);
      printResults("Vector", durations);
   }

   private static final void runFloatArrayBenchmark(int vectors, int floats, int bytes, int runs)
   {
      float[] a = new float[floats];
      float[] b = new float[floats];
      float[] dst = new float[floats];

      long[] durations = new long[runs];
      for (int i = 0; i < durations.length; i++)
      {
         long t0 = System.nanoTime();
         testFloatArrayMulPerformance(a, b, dst, vectors);
         long t1 = System.nanoTime();
         long took = (t1 - t0) - 2 * nanoTimeOverhead;
         durations[i] = took;
      }

      Arrays.sort(durations);
      printResults("float[]", durations);
   }

   private static final void runFloatBufferBenchmark(int vectors, int floats, int bytes, int runs)
   {
      FloatBuffer a = ByteBuffer.allocateDirect(bytes).order(ByteOrder.nativeOrder()).asFloatBuffer();
      FloatBuffer b = ByteBuffer.allocateDirect(bytes).order(ByteOrder.nativeOrder()).asFloatBuffer();
      FloatBuffer dst = ByteBuffer.allocateDirect(bytes).order(ByteOrder.nativeOrder()).asFloatBuffer();

      long[] durations = new long[runs];
      for (int i = 0; i < durations.length; i++)
      {
         long t0 = System.nanoTime();
         testFloatBufferMulPerformance(a, b, dst, vectors);
         long t1 = System.nanoTime();
         long took = (t1 - t0) - 2 * nanoTimeOverhead;
         durations[i] = took;
      }

      Arrays.sort(durations);
      printResults("FloatBuffer", durations);
   }

   private static final void runFloatPointerBenchmark(Unsafe unsafe, int vectors, int floats, int bytes, int runs)
   {
      long a = malloc_aligned(unsafe, bytes, 4 * 4);
      long b = malloc_aligned(unsafe, bytes, 4 * 4);
      long dst = malloc_aligned(unsafe, bytes, 4 * 4);

      long[] durations = new long[runs];
      for (int i = 0; i < durations.length; i++)
      {
         long t0 = System.nanoTime();
         testFloatPointerMulPerformance(unsafe, a, b, dst, vectors);
         long t1 = System.nanoTime();
         long took = (t1 - t0) - 2 * nanoTimeOverhead;
         durations[i] = took;
      }

      Arrays.sort(durations);
      printResults("float*", durations);
   }

   private static final long measureNanoTimeOverhead()
   {
      long nanoTimeOverhead = Long.MAX_VALUE;
      for (int k = 0; k < 64; k++)
      {
         for (int i = 0; i < 1024; i++)
            System.nanoTime();

         long runNanoTimeOverhead = Long.MAX_VALUE;
         for (int i = 0; i < 64; i++)
         {
            long currentOverhead = -(System.nanoTime() - System.nanoTime());
            if (currentOverhead < runNanoTimeOverhead)
               runNanoTimeOverhead = currentOverhead;
         }

         if (runNanoTimeOverhead < nanoTimeOverhead)
            nanoTimeOverhead = runNanoTimeOverhead;
      }
      return nanoTimeOverhead;
   }

   private static long malloc_aligned(Unsafe unsafe, long bytes, long align)
   {
      long base = unsafe.allocateMemory(bytes + align);
      long pntr = base + (align - (base % align));
      return pntr;
   }

   private static void testFloatObjectMulPerformance(FloatVector[] a, FloatVector[] b, FloatVector[] dst, int vectorCount)
   {
      for (int i = 0; i < vectorCount; i++)
      {
         int p = i;
         dst[p].x = a[p].x * b[p].x;
         dst[p].y = a[p].y * b[p].y;
         dst[p].z = a[p].z * b[p].z;
      }
   }

   private static void testFloatArrayMulPerformance(float[] a, float[] b, float[] dst, int vectorCount)
   {
      for (int i = 0; i < vectorCount; i++)
      {
         int p = i << 2;
         dst[p + 0] = a[p + 0] * b[p + 0];
         dst[p + 1] = a[p + 1] * b[p + 1];
         dst[p + 2] = a[p + 2] * b[p + 2];
      }
   }

   private static void testFloatBufferMulPerformance(FloatBuffer a, FloatBuffer b, FloatBuffer dst, int vectorCount)
   {
      for (int i = 0; i < vectorCount; i++)
      {
         int p = i << 2;
         dst.put(p + 0, a.get(p + 0) * b.get(p + 0));
         dst.put(p + 1, a.get(p + 1) * b.get(p + 1));
         dst.put(p + 2, a.get(p + 2) * b.get(p + 2));
      }
   }

   private static void testFloatPointerMulPerformance(Unsafe unsafe, long a, long b, long dst, int vectorCount)
   {
      for (int i = 0; i < vectorCount; i++)
      {
         int p = i << 4;

         unsafe.putFloat(dst + (p + 0), unsafe.getFloat(a + (p + 0)) * unsafe.getFloat(b + (p + 0))); // long + (int + int)
         unsafe.putFloat(dst + (p + 4), unsafe.getFloat(a + (p + 4)) * unsafe.getFloat(b + (p + 4)));
         unsafe.putFloat(dst + (p + 8), unsafe.getFloat(a + (p + 8)) * unsafe.getFloat(b + (p + 8)));
      }
   }

   private static void testFloatPointerMulPerformance_slower(Unsafe unsafe, long a, long b, long dst, int vectorCount)
   {
      for (int i = 0; i < vectorCount; i++)
      {
         int p = i << 4;
         long ap = p + a;
         long bp = p + b;
         long dstp = p + dst;

         unsafe.putFloat(dstp + 0, unsafe.getFloat(ap + 0) * unsafe.getFloat(bp + 0));
         unsafe.putFloat(dstp + 4, unsafe.getFloat(ap + 4) * unsafe.getFloat(bp + 4));
         unsafe.putFloat(dstp + 8, unsafe.getFloat(ap + 8) * unsafe.getFloat(bp + 8));
      }
   }
}

(please don’t whine about how flawed the benchmark is)

Output:


Java version: 1.6.0_20
JavaVM version: 16.3-b01
nanoTimeOverhead: 698ns

... [snip] ...

Vector.best:   26us
Vector.good:   26us
Vector.median: 27us
Vector.bad:    28us
Vector.worst:  52us

FloatBuffer.best:   20us
FloatBuffer.good:   20us
FloatBuffer.median: 20us
FloatBuffer.bad:    21us
FloatBuffer.worst:  112us

float[].best:   35us
float[].good:   35us
float[].median: 36us
float[].bad:    38us
float[].worst:  157us

float*.best:   17us
float*.good:   17us
float*.median: 18us
float*.bad:    19us
float*.worst:  95us

Now that we have a baseline, and it turns out Unsafe is (even more) feasible, I can try to get that MappedObject impl. up and running.

There are no JNI calls.

See above benchmark, on how Unsafe beats performance of sequentially allocated instances

Yes, I do. Maybe even worse, as the fields can partically overlap. This ‘support’ is a side effect and can easily be disabled with some checks/exceptions during initializing the sliding-window / mapped-object. These checks are advisable anyway, because on non-x86 architectures misaligned memory access can crash a process.

After a single GC all your hard work is lost. Still it might be a good idea to do this once every few seconds.

I think a GC will still approximately keep the order, especially a generational GC and when the graph has the same access pattern (ie all are referenced from an array).

But Unsafe has native methods in it. I’ve always just assumed that the native method meant that it used JNI, or can the JVM have super-fast native functions too?

The JVM is a native process. It can read/write at any memory address (within its virtual address space). Writing into an object field is basically using the object as a pointer and the field as an offset.

It’s actually pretty easy to figure out the pointer/reference of an object, as the JVM sees it:

  1. make and Object[1]
  2. write the object reference into index 0
  3. read the array index as an int (32 bit jvm) or long (64 bit jvm) using the Unsafe class

You now have the location of an Object on the heap, which was obviously already known by the JVM. If you’d write in that region, you’d be modifying fields.

The Unsafe class is basically a backdoor into the JVM, into what it already can do, and what you’re not supposed to be able to do in Java. All direct Buffers (Floatbuffer, IntBuffer, etc) are backed by Unsafe. The reason that Buffers are still reasonably fast, despite the bunch of abstraction layers, is that the JVM turns all these calls to Unsafe methods into machine instructions.

This is why:

         unsafe.putFloat(dst + (p + 0), unsafe.getFloat(a + (p + 0)) * unsafe.getFloat(b + (p + 0)));
         unsafe.putFloat(dst + (p + 4), unsafe.getFloat(a + (p + 4)) * unsafe.getFloat(b + (p + 4)));
         unsafe.putFloat(dst + (p + 8), unsafe.getFloat(a + (p + 8)) * unsafe.getFloat(b + (p + 8)));

is about as fast as:

         dst[p + 0] = a[p + 0] * b[p + 0];
         dst[p + 1] = a[p + 1] * b[p + 1];
         dst[p + 2] = a[p + 2] * b[p + 2];

actually, all those Unsafe method calls are slightly faster than pure float[] access, probably because there are no bound-checks.

hm, be interesting to actually look at the machine code it outputs. I’m thinking it should genuinely be as fast as C if it’s intrinsified by the JVM.

Cas :slight_smile:

The problem with Unsafe is that all pointers are long. On 32 bit JVMs this is rather inefficient because all memory address calculations have to be converted to longs, which the JVM then has to truncate back to an int. This simply can’t even remotely be as fast as C code doing pointerbumps on ints. Or the JVM manages to get rid of the int->long->int conversion all together.

Still, running the above code on a 32 bit JVM vs a 64 bit JVM doesn’t show any performance difference, leading to the conclusion that we have another bottleneck: either the cache is the bottleneck or the JVM doesn’t remotely reach the maximum possible performance, in this benchmark, for all cases: float[], FloatBuffer and FloatVector.