Have you got a chance to rework this? Also the bytecode transformer part? This sounds exciting!
First time I did bytecode transformation… but I got it to work! It’s fairly basic at this point, but I will share it once I cleaned it up. You can choose whether you want your objects backed by a float[] or a FloatBuffer (only floats are supported atm) and possibly raw pointers at some point.
Meanwhile, here is some example code:
private int backing_offset = 0;
private float[] backing_array = null;
private FloatBuffer backing_buffer = null;
// will be a const in bytecode
public int static sizeof;
// these fields will not even exist at runtime
@FieldOffset(0) public float x;
@FieldOffset(1) public float y;
@FieldOffset(2) public float z;
public void index(int index) // will be generated later on
{
this.backing_offset = this.sizeof * index;
}
public void test()
{
this.index(1);
System.out.println(Arrays.toString(this.backing_array));
this.x = 13.13f;
System.out.println(Arrays.toString(this.backing_array));
this.y = 14.14f;
System.out.println(Arrays.toString(this.backing_array));
this.z = this.x * this.y;
System.out.println(Arrays.toString(this.backing_array));
}
public VectorStruct duplicate() // will probably be generated too
{
VectorStruct copy = new VectorStruct();
copy.backing_offset = this.backing_offset;
copy.backing_array = this.backing_array;
copy.backing_buffer = this.backing_buffer;
return copy;
}
It outputs (writing into the 2nd struct):
[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
[0.0, 0.0, 0.0, 13.13, 0.0, 0.0, 0.0, 0.0, 0.0]
[0.0, 0.0, 0.0, 13.13, 14.14, 0.0, 0.0, 0.0, 0.0]
[0.0, 0.0, 0.0, 13.13, 14.14, 185.6582, 0.0, 0.0, 0.0]
If we consider multi-threaded access, for example when doing a computation using JDK7’s fork/join framework, I guess we’ll have to use duplicate() in a ThreadLocal’s initialValue(), is that correct? So that different threads can work on a different backing_offset.
Indeed. Just like direct ByteBuffer instances with relative access, mapped objects are not thread-safe.
I rewrote the implementation to use direct memory access. All primitives are supported now.
You’ll map a type using:
MappedVec2 vec2s = MappedVec2.map(ByteBuffer)
As adding a Java agent to your application might be a tad unobvious for average Joe, you have the option to install the bytecode transformer through code.
public static void main(String[] args)
{
if(MappedInstanceTransformer.fork(MyApp.class, args))
{
MappedInstanceTransformer.register(MappedVec2.class);
MappedInstanceTransformer.register(MappedVec3.class);
MappedInstanceTransformer.register(MappedVec4.class);
return;
}
// your code
ByteBuffer bb = ByteBuffer.allocateDirect(MappedVec2.sizeof * n);
MappedVec2 vec2s = MappedVec2.map(bb);
}
The ‘fork’ method will grab the URLs of the application ClassLoader, creates a new ClassLoader that transforms the classes, and calls the main-method again, using a class from the new classloader.
Is the source/lib still available? The download links are broken, thanks.
I want to clean it up first (and run some proper benchmarks).
I’ll plug it into my sprite engine and see what sort of a performance boost I can get
Cas
It seems HotSpot doesn’t really like the bytecodes I feed it.
At the moment it reaches 40% of the performance of field-access.
I believe I can do much better, by duplicating the style of bytecode javac
generates.
Is that on the client VM? I’ve had similar code using Unsafe failing miserably on the client VM, it couldn’t inline all the way to the intrinsified methods. Enabling tiered compilation helped but not much. On server everything was fine and faster than non-Unsafe code.
Well, I put a lot of bytecodes in the callsite which makes the method-body rather big, and makes it harder for HotSpot to find the patterns. I’m now rewriting it to method-calls, so HotSpot can inline it, effectively ending up with the same bytecode…
I’m getting reasonably comparable performance now:
JDK 1.6.0u26 x86
instance took: 355ms
mapped took: 460ms
JDK 1.6.0u26 x64
instance took: 355us
mapped took: 455us
both running server VMs, which brings overhead to ‘only’ 28%
I’m a fan of the ‘release early’ so here goes:
Code quality is crap, but it works.
Feedback / bug reports are appreciated.
I need to add -noverify for it to work. I’m on 1.7.0_b146. Else I’m getting this:
Caused by: java.lang.VerifyError: Expecting a stack map frame in method eden.mapped.TestMappedObject.testWriteFieldAccess(Leden/mapped/MappedVec3;)V at offset 19
Results are similar to yours, 351us instanced vs 453us mapped.
One thing you can try is “baking” MappedObjectUnsafe into MappedObject and transforming the static methods to instance ones, like so:
public void fput(float value, long addr)
{
INSTANCE.putFloat(baseAddress + addr, value);
}
public float fget(long addr)
{
return INSTANCE.getFloat(baseAddress + addr);
}
That’s the only difference I can think of between my implementation and yours (note: I’m not doing any bytecode transformation). I don’t know if it will make any difference on performance, but it will make the forked bytecode shorter.
Interesting… I’ll look into that.
[quote=“Spasi,post:34,topic:31992”]
What’s the performance like in your implementation?
I’m doing multithreaded matrix-matrix multiplication on direct FloatBuffers. It’s 10% faster than the same code on float[].
Anyway, I just tried it and performance is the same.
(also baseAddress in my previous reply should have been viewAddress)
Interesting, although that’s not quite the same as object-field access. I’ll see how my mappedobjects compare to float[] performance.
instance took: 357us
mapped took: 447us
backing array took: 454us
plain array took: 384us
plain unsafe took: 379us
I guess I hit the limit of what indirect access to arrays and buffers can achieve…
Having said that, any suggestions on the API?