Once again! fast MappedObjects implementation

concerto49 · June 10, 2011, 5:58am

Have you got a chance to rework this? Also the bytecode transformer part? This sounds exciting!

Riven · June 23, 2011, 11:47pm

First time I did bytecode transformation… but I got it to work! It’s fairly basic at this point, but I will share it once I cleaned it up. You can choose whether you want your objects backed by a float[] or a FloatBuffer (only floats are supported atm) and possibly raw pointers at some point.

Meanwhile, here is some example code:


   private int         backing_offset = 0;
   private float[]     backing_array  = null;
   private FloatBuffer backing_buffer = null;

   // will be a const in bytecode
   public int static sizeof;               

   // these fields will not even exist at runtime
   @FieldOffset(0)   public float        x;              
   @FieldOffset(1)   public float        y;   
   @FieldOffset(2)   public float        z;   

   public void index(int index) // will be generated later on
   {
      this.backing_offset = this.sizeof * index;
   }

   public void test()
   {
      this.index(1);

      System.out.println(Arrays.toString(this.backing_array));
      this.x = 13.13f;
      System.out.println(Arrays.toString(this.backing_array));
      this.y = 14.14f;
      System.out.println(Arrays.toString(this.backing_array));
      this.z = this.x * this.y;
      System.out.println(Arrays.toString(this.backing_array));
   }

   public VectorStruct duplicate() // will probably be generated too
   {
      VectorStruct copy = new VectorStruct();
      copy.backing_offset = this.backing_offset;
      copy.backing_array = this.backing_array;
      copy.backing_buffer = this.backing_buffer;
      return copy;
   }

It outputs (writing into the 2nd struct):


[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
[0.0, 0.0, 0.0, 13.13, 0.0, 0.0, 0.0, 0.0, 0.0]
[0.0, 0.0, 0.0, 13.13, 14.14, 0.0, 0.0, 0.0, 0.0]
[0.0, 0.0, 0.0, 13.13, 14.14, 185.6582, 0.0, 0.0, 0.0]

Spasi · June 25, 2011, 11:38pm

If we consider multi-threaded access, for example when doing a computation using JDK7’s fork/join framework, I guess we’ll have to use duplicate() in a ThreadLocal’s initialValue(), is that correct? So that different threads can work on a different backing_offset.

Riven · June 26, 2011, 10:52am

Indeed. Just like direct ByteBuffer instances with relative access, mapped objects are not thread-safe.

Riven · June 26, 2011, 11:11am

I rewrote the implementation to use direct memory access. All primitives are supported now.

You’ll map a type using:

MappedVec2 vec2s = MappedVec2.map(ByteBuffer)

As adding a Java agent to your application might be a tad unobvious for average Joe, you have the option to install the bytecode transformer through code.


public static void main(String[] args)
{
    if(MappedInstanceTransformer.fork(MyApp.class, args))
    {
         MappedInstanceTransformer.register(MappedVec2.class);
         MappedInstanceTransformer.register(MappedVec3.class);
         MappedInstanceTransformer.register(MappedVec4.class);
         return;
    }

    // your code

    ByteBuffer bb = ByteBuffer.allocateDirect(MappedVec2.sizeof * n);
    MappedVec2 vec2s = MappedVec2.map(bb);
}

The ‘fork’ method will grab the URLs of the application ClassLoader, creates a new ClassLoader that transforms the classes, and calls the main-method again, using a class from the new classloader.

concerto49 · June 26, 2011, 12:36pm

Is the source/lib still available? The download links are broken, thanks.

Riven · June 26, 2011, 1:04pm

I want to clean it up first (and run some proper benchmarks).

princec · June 26, 2011, 3:28pm

I’ll plug it into my sprite engine and see what sort of a performance boost I can get

Cas

Riven · June 26, 2011, 3:30pm

It seems HotSpot doesn’t really like the bytecodes I feed it.

At the moment it reaches 40% of the performance of field-access.

I believe I can do much better, by duplicating the style of bytecode javac generates.

Spasi · June 26, 2011, 3:55pm

Is that on the client VM? I’ve had similar code using Unsafe failing miserably on the client VM, it couldn’t inline all the way to the intrinsified methods. Enabling tiered compilation helped but not much. On server everything was fine and faster than non-Unsafe code.

Riven · June 26, 2011, 4:00pm

Well, I put a lot of bytecodes in the callsite which makes the method-body rather big, and makes it harder for HotSpot to find the patterns. I’m now rewriting it to method-calls, so HotSpot can inline it, effectively ending up with the same bytecode…

Riven · June 26, 2011, 4:18pm

I’m getting reasonably comparable performance now:

JDK 1.6.0u26 x86


instance took: 355ms
mapped took: 460ms

JDK 1.6.0u26 x64


instance took: 355us
mapped took: 455us

both running server VMs, which brings overhead to ‘only’ 28%

Riven · June 26, 2011, 4:58pm

I’m a fan of the ‘release early’ so here goes:

Code quality is crap, but it works.
Feedback / bug reports are appreciated.

Spasi · June 26, 2011, 6:10pm

I need to add -noverify for it to work. I’m on 1.7.0_b146. Else I’m getting this:

Caused by: java.lang.VerifyError: Expecting a stack map frame in method eden.mapped.TestMappedObject.testWriteFieldAccess(Leden/mapped/MappedVec3;)V at offset 19

Results are similar to yours, 351us instanced vs 453us mapped.

One thing you can try is “baking” MappedObjectUnsafe into MappedObject and transforming the static methods to instance ones, like so:

public void fput(float value, long addr)
   {
      INSTANCE.putFloat(baseAddress + addr, value);
   }

   public float fget(long addr)
   {
      return INSTANCE.getFloat(baseAddress + addr);
   }

That’s the only difference I can think of between my implementation and yours (note: I’m not doing any bytecode transformation). I don’t know if it will make any difference on performance, but it will make the forked bytecode shorter.

Riven · June 26, 2011, 6:33pm

Spasi:

I need to add -noverify for it to work. I’m on 1.7.0_b146. Else I’m getting this:
Caused by: java.lang.VerifyError: Expecting a stack map frame in method eden.mapped.TestMappedObject.testWriteFieldAccess(Leden/mapped/MappedVec3;)V at offset 19

Interesting… I’ll look into that.

[quote=“Spasi,post:34,topic:31992”]
What’s the performance like in your implementation?

Spasi · June 26, 2011, 7:00pm

I’m doing multithreaded matrix-matrix multiplication on direct FloatBuffers. It’s 10% faster than the same code on float[].

Spasi · June 26, 2011, 7:05pm

Anyway, I just tried it and performance is the same.

(also baseAddress in my previous reply should have been viewAddress)

Riven · June 26, 2011, 7:15pm

Interesting, although that’s not quite the same as object-field access. I’ll see how my mappedobjects compare to float[] performance.

Riven · June 26, 2011, 7:37pm


instance took:         357us

mapped took:           447us
backing array took:    454us

plain array took:      384us
plain unsafe took:     379us

I guess I hit the limit of what indirect access to arrays and buffers can achieve…

Riven · June 26, 2011, 8:43pm

Having said that, any suggestions on the API?