Minor optimisation for vertex arrays and a request

for (int i=0; i<len; i++)
{
    vertexBuffer.put(x).put(y).put(z);
}

Is significantly slower than


float[] temp = new float[len*3];
for (int i=0; i<len; i++)
{
    temp[i*3+0] = x;
    temp[i*3+1] = y;
    temp[i*3+2] = z;
}
vertexBuffer.put(temp);

My test application renders a 3000 poly model at 90 fps with the first method, and 550 fps with the second. (I have to rebuild the buffers every time as I’m doing skinned animation)

In light of this, it would be nice to have a glVertexPointer (and color-, and the other pointers) method that takes a float[] and sends it to native itself behind the scenes.

Eh, nevermind, that won’t work. JOGL won’t know when it’s safe to release that buffer.

Carry on. :wink:

Oh well, I thought this was common knowledge :slight_smile:

I’ve had a wrapper-class for this that would only put everything in a native-buffer when the data was about to be rendered and was not changed.

With the 1.4 server-vm (or 1.6 client-vm for that matter) things really start to fly, as arrays get really fast.

I knew it was faster, but not… you know… five times faster. :wink:

I thought the put methods would get inlined.

Only effectively with the server-vm, but even then it’s slow.

If you want to get really cranky, abuse the Unsafe class and do your own pointer-arithmetic, it’s about 10% faster than field-access (!!) in the server-vm. You should search the forum for that topic.

* Riven shuts up already…

10% is just barely worth optimising for. 500% definitely is. :wink:

The reason I’ve never run into this before is that I never have to rebuild the vertices for each frame, so it’s never been a bottleneck. I’m going to go change a lot of code now. :smiley:

(also, my main reason for posting this was to request the not-so-thought-through addition of the new method. But that won’t work)

Use the absolute put() methods instead of the relative ones for better performance. See the slides from Sven Goethel’s and my JavaOne 2002 talk on the JOGL web page.

As an aside, please do not use sun.misc.Unsafe directly. You should be able to get all of the performance you need out of the NIO direct buffer classes. If you can’t, please file a bug.

As you have participated in this thread…
http://www.java-gaming.org/forums/index.php?topic=11112.msg88496#msg88496

Just one of my benchmarks: (running server vm 1.5)


Running benchmark with 2048 3d vecs...
math on Vec3[]:          66.4ms      30800 / sec <---
math on FloatBuffer:    299.4ms       6800 / sec
math on unsafe buffer:   58.9ms      34700 / sec <---
math on unsafe struct:  107.0ms      19100 / sec

299ms / 59ms = 5x faster, FloatBuffer is really slow, and Unsafe is even faster than field-access.

And I’m not going to file a bug, as all previous bug-reports have been blunty ignored so far. Not doing it again.

Could you please post complete source code for this benchmark or email it to me at kbr at dev.java.net?

If you don’t file a bug, you will have no right to complain. I would file bugs just so I can say to Sun later on, “I told you so.”

swpalmer: I’m not complaining :slight_smile: Just proving that a statement is wrong.

ken russell: you can find the sourcecode in the referenced thread

I don’t see an obvious link to a complete, compilable class or set of classes. Could you please either point me to it or generate it and attach it here?

Okay, it wasn’t really compilable, but I expected it to be enough :slight_smile:

No problem, I’ll make a set of classes tonight (GMT+1), as I’m at work now.

Ooh, statistics!

Hmm, unsafe buffers are 13% faster… that’s borderline worth optimising for. Definitely so if vertex transfer is a bottleneck.
I wish it was slightly more kosher so I’d dare using it. sun.* packages are a big nono.

To get any serious results, use the server VM:

Self-contained compilable sourcecode with the following results:

duration of last 8 runs:
---> arr:	264ms
---> buf:	1445ms
---> pnt:	240ms

arr = float[]-based
buf = FloatBuffer-based (~5.5x slower than float[])
pnt = pointer-based (~10% faster than float[])

Update:
It does a simple weight of 2 data-sources and stores the result in the third:

c = a*x + b*(1-x)

I’ve tried to optimize all three ways for best performance, by trial-and-error.

float[]

      // unrolling this loop makes it slower
      for (int i = 0; i < a.length; i++)
         c[i] = aMul * a[i] + bMul * b[i];

FloatBuffer

      while(fbA.hasRemaining())
      {
         fbC.put(aMul * fbA.get() + bMul * fbB.get());
         fbC.put(aMul * fbA.get() + bMul * fbB.get());
         fbC.put(aMul * fbA.get() + bMul * fbB.get());
      }

pointer-aritmetic

      for (int i = -4; i < bytes;)
      {
         unsafe.putFloat((i += 4) + c, unsafe.getFloat(i + a) * aMul + unsafe.getFloat(i + b) * bMul);
         unsafe.putFloat((i += 4) + c, unsafe.getFloat(i + a) * aMul + unsafe.getFloat(i + b) * bMul);
         unsafe.putFloat((i += 4) + c, unsafe.getFloat(i + a) * aMul + unsafe.getFloat(i + b) * bMul);
      }

Update 2:
Using pointer-arithmetic, cache-misses kick in much much later - when using large data-sets.


When processing   512 vertices   (18KB)    float[] gets 125M/s, pointers get 133M/s.
When processing  1024 vertices   (36KB)    float[] gets 125M/s, pointers get 133M/s.
When processing  2048 vertices   (72KB)    float[] gets 125M/s, pointers get 133M/s.
When processing  4096 vertices  (144KB)    float[] gets  92M/s, pointers get 133M/s. <--
When processing  8192 vertices  (288KB)    float[] gets  42M/s, pointers get 133M/s. <--
When processing 16384 vertices  (576KB)    float[] gets  42M/s, pointers get  42M/s.
When processing 32768 vertices (1152KB)    float[] gets  31M/s, pointers get  31M/s.

Makes you wonder what happens under the hood… :slight_smile:

You aren’t comparing apples to apples. When using the relative get() and put() methods on Buffers, this implies more work, incrementing indices in the Buffer objects after each loop iteration. If you fix the benchmark to use the absolute get() and put() methods as is discussed in the JavaOne 2002 talk I referenced, above, the buffer-based version is faster than the array-based version:


duration of last 8 runs:
---> arr:       649ms   51M vertices/s
---> buf:       582ms   57M vertices/s
---> pnt:       467ms   71M vertices/s

This is on a Pentium M 1.4 GHz with 5.0u6 and -server. Revised benchmark is attached.

Please DO NOT reference sun.misc.Unsafe directly in your classes, or at least in your products. By doing so you’re hiding potential performance issues with the public java.nio classes and only providing ammunition to parties within Sun who want to more severely restrict access to that class.

Sorry, I quickly read that post, and (stupidly) read it the other way around…

Do you mean making it accessible by that public-static method in my test-case, or just in the general case? (not referencing it anywhere == not using it)

[quote=“Ken Russell,post:16,topic:25994”]
Anyway, I think, once the problems are solved, access to Unsafe actually should be severly restricted.

After I patched my code with your snippet, I got the following results:

duration of last 8 runs:
---> arr:	265ms	126M vertices/s
---> buf:	347ms	96M vertices/s
---> pnt:	244ms	137M vertices/s

P4 2.4 @1.8GHz (533 @400MHz FSB)
512MB PC2700

The difference between buf and pnt is 42%. So there is still some room for improvement on the implementation of FloatBuffers on (at least) my hardware-configuration.

Okay, I removed the ByteBuffer->FloatBuffer code from the loop, which turned out to consume quite some cpu-cyles.

Results:

duration of last 8 runs:
---> arr:	369ms	90M vertices/s
---> buf:	333ms	100M vertices/s
---> pnt:	246ms	136M vertices/s

I uploaded the last version.

After my update, your (Ken) and my percentages are about the same. If you apply the update too, you’ll have increased performance on buffers.

I’m very curious about that.

Look again at the numbers. The issue isn’t that the buffer case got significantly faster, but that the array case got significantly slower. I also see very high variation in the array case on my machine. I’m not sure whether that’s because of data placement or because of differences in the generated machine code.

In general I would avoid reading too much into the results of microbenchmarks. The take-home point here, in my opinon, is that direct buffers are not significantly slower than arrays, at least when used properly. There are also more optimization opportunities possible in the HotSpot JVM which we will investigate (such as making the earlier version of the benchmark using the relative get/put methods perform identically to the one using the absolute versions).

Regarding sun.misc.Unsafe, I mean not referencing it at all. I think it’s fine to do so when writing performance benchmarks like this one to try to prove or disprove a performance issue, but not when writing any sort of publicly released library or application.

Yes, I noticed the math on the float[] lost about one third of its performance, after changing the FloatBuffer code.

The JIT-compiler is an impressive piece of art, appearantly invalidating the results of quite solid microbenchmarks.

I was mainly pointing at the buffer/pointer difference, as the float[] was getting way off.
But well, without knowing why the float[] performance varies, the performance of buffer vs. pointer might be influenced by “unknown forces” too, obviously.

At least I now know the fastest code for all three ways of accessing data :slight_smile:

* Riven heads off to do important stuff… :wink: