Why mapped objects (a.k.a structs) aren't the ultimate (performance) solution

zero · March 18, 2006, 7:28am

People here think that mapped objects will increase performance of their application. In this thread I want to show you an example, how to make use standard java classes for a decent speed up, which wouldn’t be possible with mapped objects (at least I guess so).

NOTE: This is NO offense to anybody. I just think people are focusing too much on mapped objects. Please proove me wrong for anything I write, that’s what the discussion forum is about!

At first I want to beg you reading my personal definitions of structs and mapped objects I posted on this topic, just to have a common base. sorry for the cross linking :

Ok, here we go.

I wanted to put a real word example for games, so I thought vertices of a triangle-mesh might be a good one, since you have to put them into a Buffer in order to send them to the graphics-card.

Let’s say a vertex consists of a position, a normal, a tangent and a 2D texture coordinate, so we can do normal/parallax mapping - I assume the binormal to be generated in the shader. Further, there has to be a perfomance benifit by using mapped objects, so let’s the the data be dynamic, because we do something like software skinning. However, the texture coodinates are static, as it should for most type of applications.

First, I’ll try to imagine how a Vertex class may look as a MappedObject.

In order to make it simple I just use fields and assume accessing them manipulates or reads from a buffer.
(Since there a lots of different possible implementations described on this board, please comment how to change the following. )

I pure simply approach would be:


class Vertex extends MappedObject {
    float posX, posY, posZ;
    float normX, normY, normZ;
    float tanX, tanY, tanZ;
    float texX, texY,;
}

but I guess, it should be possible to make something like this:


class Vector3 extends MappedObject {
    float x,y,z;
}

class Vector2 extends MappedObject {
    float x,y;
}

class Vertex extends MappedObject {
    Vector3 pos, norm, tan;
    Vector2 tex;
}

This will end up in either an array of those mapped objects:


Vertex[] vertices;

or like Riven proposed in a StructBuffer:


Vertex vertices = new Vertex(buf);

This is were IMHO opinion the first problem occurs, since on the application side a vertex is a logical unit, but for the graphics-card, you have to split up static und dynamic data. Well, you don’t have to, but I’m sure everyone will agree that sending static data (here the texture coordiantes) every single frame over the bus really decreases performance!


class DynanimcVertex extends MappedObject {
    Vector3 pos, norm, tan;
}

DynamicVertex[] dynamicVerts;
Vector2[] staticVerts;

OK, it should be possible to create both from the same underlying ByteBuffer the Vertices are mapped to, but your dynamic data isn’t packed tightly anymore. Altough it is possible to adjust the stride parameter, e.g. for glVertexPointer, OpenGL performs much better with tight data. However, the main problem is that you have to send the whole buffer down to the graphics-cards. The only proper solution is to have 2 Buffer, one for the static and one for the dynamic data.

Now my questions is, can an object be mapped to 2 different buffers?
I guess not, at least maintaining their promised performance.

Summing up, from my understanding there is no way having 2 different Buffers (static and dynamic) and a object mapped to both, here the Vertex-class with its positions, normals and tangents mapped to the first buffer and the texture coordinates mapped to the second buffer. IMHO not having a single Vertex class is somehow ugly code.

So far, I focused on a possible limitation of mapped objects, from now on I’ll try to explain how Java’s standard classes (reference types) can be used to increase performance.

What astonishes me most is that Java guys, still think in a C/C++ manner. I know a low of people complaining that in Java classes, in contrast to C#, can only be reference types.
Comparing to C/C++, an array of a class is an aray of pointers:


Vector3[] vecs = new Vector3[size];  // Java
Vector** pVecs = new (Vector3*)[size]; // C++

Actually, this brought me to an idea, how to optimise my mesh class:

As the graphics guys of you know, some of the vertex data rely on face information, like the normals (smoothing groups/hard edges) or materials (different textures/colors). Therefore you have to split a vertex whenever two neighbouring facess either have

a hard edge, resulting in a different normals for the same position
diffrent materials, resulting on different texture coordinates for the some position

since the tangent depends on both, the normals and the texture coordinates, it will be different for the same position if one of the above cases is true.

Most implementations however, don’t use the information that the positions for the duplicated vertices are the same. The same is true for the nomals, splitted by different materials.

Therefore, I use a representation like this:


class Vertex {
   Vector3 pos, norm, tan;
    Vector2 tex;
}

Vector3[] positions;
Vector3[] nomals;
Vector2[] texcoors;
Vector3[] tangents;


Vertex[] vertices;

since all are reference types (arrays pointers) they point to the same data (e.g., positions[i] == vertex[j].pos, is true for all all vertices j, duplicated accordingly to the face materials and hard-edges)

all vector arrays are duplicate free, which saves you from doing multiple modifications. this means you save modifications of the positions, whenever faces, which refer to a vertex referencing this position, have different face materials or hard edges). Same for the normals…

Of course the benifit depends on the data, actually there is none if a mesh only has a single material and no hard-edges. For my models, position transformations usaully reduce to ~2/3.
You say saving 1/3 isn’t much? keep in mind that for example software skinning usually transforms a position 1-4 times. futher there might be other modifications as well (morph-targets used to simulate muscle contraction,…). With all that you can speed up modification say about a factor of 2 (double number of FPS , if this is your bottle neck :)), which isn’t bad IMHO.

Please tell me if this technique would be possible wih mapped objects? if not I fear of loosing my 2x speedup, by removing the IMHO no so markable copy operation to the buffer.

This is brings me up to my conclusion, which is the bottle neck. I’ll never complaining that copying values to buffer would infect the performance of my applications, as long as I’m not sure whether there are other possible optimizations, which have a greater impact.

Looking forward upon your opinion

Riven · March 18, 2006, 9:02am

Count your objects… zillions. (the whole point of structs)

Invoking a method 6 times, that you made 33% faster, will still result in a 33% performance increase, not factor 2.

In my previous thread I showed that accessing an object-array with tiny-objects is very slow (the objects are spread all over the heap! cache misses!)

My advise, do some benchmarks before you’re suggesting such alternatives.

zero · March 18, 2006, 9:27am

As I told you in the other thread, the classes in the second half are pure standard java classes. so there should be NO speed issus with that.

Ok, in your microbenchmark you stated that accesing fields of a a java class is slower than your struct implementation, but I doubt that


Vector3[] vecs .. ; // oridary array of a standard java class
vec[i].x += 1.0f

can’t compte with:


Vector3 vec = new Vector3(buffer);
vec.position(i);
vec.x(vec.x() + value);

and if so, Java has a problem IMHO

further, I benchmark the scend techniwue for the 2x speed up (real app nok micro sh#$)
Again note this was NOT a comparison between struct code and without, just saving modifications or not! I only asked whether this optimisation would be possible with structs/mapped objects.

Riven · March 18, 2006, 9:43am

zero:

As I told you in the other thread, the classes in the second half are pure standard java classes. so there should be NO speed issus with that.

Ok, in your microbenchmark you stated that accesing fields of a a java class is slower than your struct implementation, but I doubt that
Vector3[] vecs .. ; // oridary array of a standard java class
vec[i].x += 1.0f
can’t compte with:
Vector3 vec = new Vector3(buffer);
vec.position(i);
vec.x(vec.x() + value);
and if so, Java has a problem IMHO

Then Java has a problem (…) This is caused by the random-access-style of fetching these objects from RAM, which are all over the place. With a buffer it’s pretty much a sequential read. So don’t blame Java, it occurs in all languages.

Check your topic.

zero · March 18, 2006, 9:47am

first half deals about dividing mapped objects into static and dynamic data, whic seems to be problematic to me. second half shows an optimization technique, which I doubt to be possible with mapped objects.

so what’s wrong with the topic?

princec · March 18, 2006, 12:08pm

I think you should try and work through the actual memory accesses to understand the issue first.

Cas

zero · March 18, 2006, 12:18pm

hey Cas, I’m happy that you notice this thread, since AFAIK you’re one of top promoters of mapped objects (if not thier inventor?).

Well, short answer. which memory access are you thinking of, accesing java fields or putting data into a buffers?
I only use standard field access, this is by mean fast enough for me. The only except is putting the dynamic data once per frame to a buffer. Are you talking about this kind of memory access? If so, please tell why simple putting data into a buffer can have a great impact on the performance.

swpalmer · March 18, 2006, 10:45pm

One important thing that mapping to ByteBuffers deals with is how it interfaces to I/O … both I/O to native code and network or disk structures. Being able to set the Byte Order is important… being able to map to Direct Byte Buffers so the C code can have efficient access to the data is also significant.

zero · March 19, 2006, 6:37am

thanks for the reply, swpalmer.

maybe I got s.t.h wrong, but are you explaining why buffers in general are needed? I did never doubt that, I just argued that mapped objets, which fetch their data directly form a buffer aren’t that well IMHO.

I’m not against an automatic, efficient put / get mechanism from an object (struct) to a buffer. In contrast that would be nice, because some bounds checking could be eliminated and it would make the code somehow cleaner.
On the other hand, still nobody has told me about a situation in real app, where copying the data to a buffer would be the bottleneck. so there should no hurry, even for this type of optimization.

rreyelts · March 20, 2006, 6:16pm

[quote=“zero,post:9,topic:26597”]
I wrote a GIS mapping engine from scratch. To make it perform fast, I stored the mapping data directly in memory. We’re talking about hundreds of millions of objects worth of data. First of all, I can’t even hold those many objects in a 32-bit JVM, because their headers alone, would cost me a gigabyte of memory. Second, copying that data around just to use it would be pure insanity. Third, having that many objects in memory thrashes the gc every time it goes to do a full scan.

So, to get around these problems, I load the data into DirectByteBuffers which I wrap with flyweight classes. I think structs is a terrible name for this. It’d be better off being called FlyweightView or some such thing. The whole point around this, is that you’re avoiding the creation of huge numbers of stateful objects. Flyweight is a well known and understood name for this.

If I weren’t bothered with a bajillion other things right now, I would just write the bytecode transformations for it so there would be an end to the discussion. /sigh

rreyelts · March 20, 2006, 6:24pm

[quote=“swpalmer,post:8,topic:26597”]
Specifically for interacting with huge gobs of C++ data. Passing the data through a DirectByteBuffer with a strongly typed flyweight facade means great speed with very little abstraction penalty.

princec · March 20, 2006, 9:11pm

There, he said it quite succinctly

Cas

zero · March 26, 2006, 6:48am

[quote=“rreyelts,post:10,topic:26597”]

zero:

I wrote a GIS mapping engine from scratch. To make it perform fast, I stored the mapping data directly in memory. We’re talking about hundreds of millions of objects worth of data. First of all, I can’t even hold those many objects in a 32-bit JVM, because their headers alone, would cost me a gigabyte of memory. Second, copying that data around just to use it would be pure insanity. Third, having that many objects in memory thrashes the gc every time it goes to do a full scan.

So, to get around these problems, I load the data into DirectByteBuffers which I wrap with flyweight classes. I think structs is a terrible name for this. It’d be better off being called FlyweightView or some such thing. The whole point around this, is that you’re avoiding the creation of huge numbers of stateful objects. Flyweight is a well known and understood name for this.

If I weren’t bothered with a bajillion other things right now, I would just write the bytecode transformations for it so there would be an end to the discussion. /sigh

First of all sorry for the late reply, I missed the post updates…

Can you please tell me a bit of the structure of your hundreds of millions of objects, because it difficult to answer prooperly without knowing them. Further, it would like to know how you manipulate the date, before sending it to the the graphics-card, because this is obviously the point where some kind of java-‘object’ representation (in your case: FlyweightView) comes into the game. Finally please tell me the commands (OpenGL) you use for updating the geometry.

I’m pretty sure, with this information I can explain you my point of view more easily.

zero · March 26, 2006, 7:03am

[quote=“rreyelts,post:11,topic:26597”]

I was wondering, why people seem to think that I ever doubted the performance benifits of using DirectBuffers. What really questions me is the following:

Say I have a direct buffer and the native ordering is littel endian, further you wrap a flyweight object (sliding window or what ever) around this. Let’s say this has x,y,z properties for a 3D vector. Now you’re performing certain operation, e.g. a scalar multiplication, on all your objects. but since the scalar is standard java type (big endian) the two factors have a different endian. Now my question is can the java vm efficiently calculate with them?


vector = new XXX(buffer);
float scalar = 0.4f;
for(int i=0; i<1000000; i++) {
  vector.position(i);
  vector.x(vector.x()*scalar);
  vector.y(vector.y()*scalar);
  vector.z(vector.z()*scalar);
}

Riven · March 26, 2006, 8:55am

Doing anything performance-related on non-native ordered DirectBuffers is idiotic anyway. So the problem just doesn’t occur.

zero · March 26, 2006, 9:06am

I’m taking about native buffers exclusively.

Riven · March 26, 2006, 9:54am

Direct == native :-X

princec · March 26, 2006, 1:16pm

Riven is correct; if you expect high performance on wrong-endian data you’ve got an error in brain space 0 and need to reevaluate your career choice
High performance btw. is only half the story; the other half is a clean and uncluttered (ie. not error prone) way of accessing legacy data structures directly with as little effort as possible.

Cas

zero · March 26, 2006, 1:47pm

dammit! Am I so hard to understand?

When allocating a direct buffer, or retrieving one, e.g. using glMapBuffer, it is sorted by ‘ByteOrder.nativeOrder()’:


ByteBuffer bb = ByteBuffer.allocateDirect(numElements * SIZE_OF_STRUCTURE);
bb.order(ByteOrder.nativeOrder());

In case of intel CPUs, this means LITTLE_ENDIAN, but Java types are all BIG_ENDIAN.

As a result your mapped object, sliding windows or whatever will have a LITTLE_ENDIAN ordered buffer as data backup. Further, doing computations involving Java’s primtive types (BIG_ENDIAN) result in a mix of both endians. Now I asked, whether the Java VM can handle this performant?

rreyelts · March 26, 2006, 5:40pm

[quote]In case of intel CPUs, this means LITTLE_ENDIAN, but Java types are all BIG_ENDIAN.
[/quote]
Java “the language” is big endian - Java “the virtual machine” is not.

That is, to say, Java types look and feel as if they are big endian to you (the programmer), but they are stored in your machine’s memory in the native byte order of your machine. If that were not the case, I couldn’t even begin to imagine the kinds of performance hits Java would suffer when performing even the simplest of operations.

Language features (like bitwise-operators) that make Java feel like it is big endian, are just there to help you write portable code.