Once again! fast MappedObjects implementation

concerto49 · June 27, 2011, 3:36am

Is there a reason why Janino was used for code generation and not say ASM? Is it a better lib? Do you know how they compare?

Spasi · June 27, 2011, 7:15pm

I used the debug JVM to have a look at the generated assembly. The good news is that memory is directly accessed as expected and SSE registers are used to do the math in both cases. The difference between mapped and instanced is in the loop unrolling:

Mapped is unrolled only 2 times vs 4 for instanced.
There’s some heavy operation reordering in the instanced case, whereas for mapped it’s two times the same thing and afaict in the exact same order as the corresponding bytecode instructions.

So I guess Unsafe’s disadvantage is that it limits the extend of JVM optimizations that can be applied. That doesn’t mean that mapped objects are not useful of course. There are huge memory savings to be had and you can always offload expensive computation to OpenCL.

Riven, I’ll need to use this on some real code first to have any input on the API. It looks great so far though. Btw, have you given any thought to nested structs or anything more complex like that?

Riven · June 27, 2011, 7:30pm

I manually unrolled the mapped object benchmark, and unfortunately saw no performance increase.

Well, the real savings (performance wise) are not having to copy everything to/from the buffers. It’s far from cheap, as you’re trashing the CPU cache. Back in the day (2006, when I was doing rendering) the copying around of data was a fair chunk of the available time in the render-loop.

It’s a shame that AALOAD and AASTORE don’t give you the type. It would have been great to access structs like java arrays:


vecs[0].x = vecs[1].y;

as opposed messing with mapped.view(int)

As for interleaving mapped objects (not nesting) I’m planning to add support for stride (independent of sizeof) and offset (which is basically an increased baseAddress).

You could make a VTN VBO using:


int stride = (3+2+3)*4;
Vec3 vertices  = Vec3.map(bb, 0*4, stride);
Vec2 texcoords = Vec2.map(bb, 3*4, stride);
Vec3 normals   = Vec3.map(bb, 5*4, stride);

Maybe I’ll merge them somehow, so you can .view(index) them all in one go.

Spasi · June 27, 2011, 7:55pm

Indeed, tried it as well, you get the extra unrolling (actually 2 unrolls at the Java level result in 4 unrolls at the assembly level, 4 unrolls at Java is 8 at assembly), but again you don’t get the operation reordering.

I agree about the data copy overhead of course, but I recently did a normal (OOP) quad-tree vs serialized quad-tree (with the nodes being “mapped objects”) and besides the win on tree traversing and bounds calculation (even more win if you consider how easy it was to parallelize), the normal code was taking forever to start-up and burned loads more memory, due to the instantiation of all those “Node” objects in the tree.

Riven · June 27, 2011, 9:07pm

Added support for stride:


         MappedVec3 v = MappedVec3.map(bb);
         MappedVec2 t = MappedVec2.map(bb);
         MappedVec3 n = MappedVec3.map(bb);

         int stride = v.sizeof + t.sizeof + n.sizeof;
         if (stride != 32)
            throw new IllegalStateException();

         MappedVec3.configure(v, stride, 0);
         MappedVec2.configure(t, stride, v.sizeof);
         MappedVec3.configure(n, stride, v.sizeof + t.sizeof);

Riven · June 28, 2011, 6:53am

Two bugfixes:

Stride is now copied when calling MappedObject.dup(…)
The ByteBuffer that is passed to MappedObject.map(…) is stored in a field, to prevent the GC from freeing the memory that is used by the mapped objects.

Spasi · June 28, 2011, 10:27am

Think you can add a MappedObject.sizeof(Matrix4f.class), where Matrix4f is any subclass of MappedObject? The transformer would then replace the method call with the int value of sizeof that it already knows. You would then be able to do this:

ByteBuffer buffer = ByteBuffer.allocateDirect(nodes * MappedObject.sizeof(Matrix4f.class));
Matrix4f.map(buffer);

instead of

ByteBuffer buffer = ByteBuffer.allocateDirect(nodes * (16 * 4)); // hard-coded & error-prone byte size
Matrix4f.map(buffer);

Riven · June 28, 2011, 11:10am

What I do is basically a big hack, which has nothing to do how java works

I can do this:


public class MappedObject
{
   public static final int SIZEOF = -1;
}

And rewrite GETSTATIC to push an int on the callsite:


int a = Matrix3f.SIZEOF; // after transform: int a = 48;
int b = Matrix4f.SIZEOF; // after transform: int b = 64;

The compiler will accept it, because SIZEOF is defined in the supertype MappedObject.

I use exactly the same trick to implement MappedObject.map(…) – I just look what the type of INVOKESTATIC is:


Matrix3f m3 = Matrix3f.map(bb); // after transform: Matrix3f m3 = new Matrix3f(); m3.init(bb, align, 48);
Matrix4f m4 = Matrix4f.map(bb); // after transform: Matrix4f m4 = new Matrix4f(); m4.init(bb, align, 64);

You might get a compile-time warning though, that you should read SIZEOF or call map(…) from the supertype. That’s a fair tradeoff.

To answer your question: will do, but as said it will be a static field, as opposed to a method call.

Riven · June 28, 2011, 11:19am

Changes:

added <? extends MappedObject>.SIZEOF
removed mappedObject.sizeof

Spasi · June 28, 2011, 4:29pm

Oh, I love this. Results on a ridiculously large quad-tree (2048x2048):

360.32ms [ Naive QuadTree ]
126.13ms [ Serial QuadTree ]
125.00ms [ Buffer QuadTree ]
124.88ms [ Buffer TLocal QuadTree ]
125.34ms [ Mapped QuadTree ]

Naive is the usual object-oriented implementation. Serial is the serialized version, using float arrays. Matrix multiplication then becomes:

public static void mul4f(final float[] a, final int pa, final float[] b, final int pb, final float[] t, final int p) {
	float m00 = a[pa + 0 * 4 + 0] * b[pb + 0 * 4 + 0] + a[pa + 1 * 4 + 0] * b[pb + 0 * 4 + 1] + a[pa + 2 * 4 + 0] * b[pb + 0 * 4 + 2] + a[pa + 3 * 4 + 0] * b[pb + 0 * 4 + 3];
	...
	t[p + 0 * 4 + 0] = m00;
	...

Buffer is my attempt at mapped objects + Unsafe (the FloatBuffer below is NOT java.nio.FloatBuffer, it’s my hacked version of it):

public static void mul4f(final FloatBuffer a, final int pa, final FloatBuffer b, final int pb, final FloatBuffer t, final int p) {
	float m00 = a.get(pa + 0 * 4 + 0) * b.get(pb + 0 * 4 + 0) + a.get(pa + 1 * 4 + 0) * b.get(pb + 0 * 4 + 1) + a.get(pa + 2 * 4 + 0) * b.get(pb + 0 * 4 + 2) + a.get(pa + 3 * 4 + 0) * b.get(pb + 0 * 4 + 3);
	...
	t.put(p + 0 * 4 + 0, m00);
	...

Buffer TLocal adds something like Riven’s view() on the FloatBuffer:

public static void mul4f(final FloatBuffer a, final int pa, final FloatBuffer b, final int pb, final FloatBuffer t, final int p) {
	a.setBaseOffset(pa);
	b.setBaseOffset(pb);

	float m00 = a.get(0 * 4 + 0) * b.get(0 * 4 + 0) + a.get(1 * 4 + 0) * b.get(0 * 4 + 1) + a.get(2 * 4 + 0) * b.get(0 * 4 + 2) + a.get(3 * 4 + 0) * b.get(0 * 4 + 3);
	...
	t.setBaseOffset(p);

	t.put(0 * 4 + 0, m00);
	...

Finally, Mapped uses Riven’s library-of-awesomeness:

public static Matrix4f mul4f(Matrix4f left, Matrix4f right, Matrix4f dest) {
	float m00 = left.m00 * right.m00 + left.m10 * right.m01 + left.m20 * right.m02 + left.m30 * right.m03;
	...
	dest.m00 = m00;
	...

Which is basically standard Java code (actually a copy-paste of LWJGL’s Matrix4f.mul code). Fun fact: The naive implementation peaks at 834 MB memory used, whereas the Mapped implementation at 346 MB.

Requests:

Add a static malloc(int count) method. It would be the equivalent of:

MappedClass.map(ByteBuffer.allocateDirect(count * MappedClass.SIZEOF));

Make dup() an instance method. So instead of:

private static final ThreadLocal<Matrix4f> localMatrices = new ThreadLocal<Matrix4f>() {
	protected Matrix4f initialValue() {
		return Matrix4f.dup(localMatrixData);
	}
};

we do this:

private static final ThreadLocal<Matrix4f> localMatrices = new ThreadLocal<Matrix4f>() {
	protected Matrix4f initialValue() {
		return localMatrixData.dup();
	}
};

It would then be possible to “generify” such code. I’ve changed your implementation to try this and it seems to work:

// MappedObject.java, line 38
public final <T extends MappedObject> T dup()

// MappedObjectTransformer.java, line 199
if ( opcode == INVOKEVIRTUAL && methodName.equals("dup") && className.equals(mappedType.className) && signature.equals("()L" + jvmClassName(MappedObject.class) + ";") ) {

Maybe add a method/field to retrieve the currently mapped index (the value you pass to the view method)?
It would be nice if you could add some extra code for constructors. Basically the current implementation requires an empty constructor (no params, no touching of fields or methods that use the fields, else the JVM crashes). The library should either verify that that’s the case, or you could move the init method to the MappedObject constructor and make sure all subclasses call it accordingly (when .map is called). That would require quite a bit of bytecode transformation though and I’m not sure how hard it is to do with ASM.

Riven · June 28, 2011, 5:01pm

Nice!

Easy enough.

I deliberately made it a static method, and was thinking of making view(i) static too. Every instance method I define in MappedObject can’t be used by the end-user’s type, which might end up quite restricting…

I agree though that both view(…) and dup(…) should be either instance methods or static methods.

I might even… make it a field? :point:

vec3.view++

GETFIELD would look like: (viewAddress-baseAddress)/stride
PUTFIELD would look like: viewAddress=baseAddress+stack.popInt()*stride

The problem of a non-default constructor in the supertype is that you must declare it in the subtype, or your get a compiletime error. I’d rather not have everybody define annoying constructors in they MappedObject subclasses, so that’s why I moved the code to the init() method. No matter how intelligent the transformer, I can’t solve these annoying language rules

Spasi · June 28, 2011, 5:01pm

I was thinking that maybe you can leave the Java code like it is now, but make the transformer change the constructors on the fly. So you’d go from this:

public MappedObject() {}

public Matrix4f() {
	setIdentity();
}

to this:

public MappedObject(ByteBuffer buffer, int align, int sizeof) {
	init(buffer, align, sizeof);
}

public Matrix4f(ByteBuffer buffer, int align, int sizeof) {
	super(buffer, align, sizeof);
	setIdentity();
}

I just remembered another important request: Add a .copy(source) method, or .copy(source, target) if you want it static, that gets transformed to Unsafe’s copyMemory(source.viewAddress, target.viewAddress, MappedClass.SIZEOF). This allows uber-fast copies between mapped objects.

Hmm, is there any chance to support primitive arrays as mapped fields, or would that over-complicate things? They would obviously require a fixed size (via an annotation perhaps?).

Spasi · June 28, 2011, 5:06pm

Aye, view as a field would be great.

Riven · June 28, 2011, 5:11pm

Spasi:

I was thinking that maybe you can leave the Java code like it is now, but make the transformer change the constructors on the fly. So you’d go from this:
public MappedObject() {}

public Matrix4f() {
	setIdentity();
}
to this:
public MappedObject(ByteBuffer buffer, int align, int sizeof) {
	init(buffer, align, sizeof);
}

public Matrix4f(ByteBuffer buffer, int align, int sizeof) {
	super(buffer, align, sizeof);
	setIdentity();
}

Why didn’t I think of this… :clue:

Sure. I’ll probably also add copy(MO, MO, count)

The problem there is that an array is simply an ‘anonymous’ reference on the stack. It’s quite hard to identify those. I can make assumptions on access patterns, but how would I identify this error:


byte[] arr = mapped.payload;
arr[0] = 4;
arr[1] = 3;

While this would be fine:


mapped.payload[0] = 4;
mapped.payload[1] = 3;

As there would be no byte[] at runtime, I’d have to push a pointer on the stack, and use that in every subsequent array access. Problem is that I really can’t track that, as that ‘arr’ can be passed into a method, which also has regular byte[]s passed in. It really gets much more complex than this relatively simple example.

Maybe you have an idea on how to solve this…?

Edit:
I could ‘solve’ it like this:


public class Packet extends MappedObject
{
    public int used;

    @MappedByteArray(length=32)
    public ByteBuffer payload;
}

It would be a real ByteBuffer instance, that has its private address-field modified by the ‘view’ field. To keep the GC from trippin’ I’d have to restore the address field before the ByteBuffer would be collected.

It’s a horrible hack, but it’d work.

Spasi · June 28, 2011, 5:38pm

Yes, that would be relatively clean, you won’t have to touch any code that uses the payload this way. Also… you could map the mapped buffer and get nested structs! ;D

Riven · June 28, 2011, 5:52pm

Nope. the mapped.viewAddress is independant of ByteBuffer.address, so once you modify the view-field the nested mapped object would be detached.

Spasi · June 28, 2011, 6:04pm

Yes, obviously you’d have to remap every time you change the view index.

Btw, I’m not sure how you plan to allocate the mapped buffers, but it could potentially become expensive on e.g. lots of .dup(). Unless you always allocate 1 byte and hack Buffer.capacity as well. Since there’s no way to create a direct ByteBuffer anywhere in memory from Java (afaik), we use JNI’s NewDirectByteBuffer in LWJGL (see safeNewBuffer in common_tools.h). We use it on functions like GL15.glMapBuffer.

Riven · June 28, 2011, 7:33pm

I wouldn’t allocate anything on dup(…) or on ‘embedding’ a nested buffer, I’d just call ByteBuffer.slice() or ByteBuffer.duplicate() on some global (static final) ByteBuffer, they have ‘private’ address & capacity fields, which you can manipulate, to point it to the desired memory region.

I feel like I didn’t quite get the question, because this seems too obvious.

BTW: direct-allocating 1 byte still malloc()s 4096 bytes (1 page) of memory.

Riven · June 28, 2011, 9:47pm

Changes:

the new way to manipulate the index is the ‘view’ field, probably renamed to ‘index’ (?)
added support for malloc(count)
added support for copyTo(target)
added support for copyRange(target, instances)
added support for the creation of arbitrarily ranged ByteBuffers
major code cleanup
added readme & license
made dup(…) an instance method.

Bugfixes:

copy the align & buffer fields on dup(…)
made MappedObject.configure(…) stricter: view must be zero, or else ‘view’ can get corrupt as the stride is manipulated
worked around an issue where mapped.view++ did a GETFIELD MO.view instead of GETFIELD V3.view, which resulted in fireworks

Todo:

mapped buffers :point: (and thus losely nested mapped objects)
support user-defined constructors
optional bounds-checks of ‘view’ field
figure out what to do if copyTo(…) / copyRange(…) result int a memcopy with an overlap?
rewrite field byte-offset calculation code, to prevent misaligned fields and unions (overflowing ‘sizeof’)

Please keep in mind that mapped.view++ is currently a rather heavy operation:


int temp = (mapped.viewAddress - mapped.baseAddress) / mapped.stride;
temp++;
mapped.viewAddress = mapped.baseAddress + temp * mapped.stride;

I might, some day, optimize it to:

mapped.viewAddress += mapped.stride;

:yawn:

Spasi · June 29, 2011, 12:07am

[quote=“Riven,post:58,topic:31992”]
OK, that’s “hackier” than I thought, it should work fine.

[quote=“Riven,post:58,topic:31992”]
This should be fixed in JDK 7. There’s also a flag to force the old behavior (-XX:+PageAlignDirectMemory -Dsun.nio.PageAlignDirectMemory=true).