FloatBuffer.put (int index, float f) expensive

I am working on a particle system using OpenGL on Android 2.1. To communicate with OpenGL a FloatBuffer is used. Allocated as such:

buffer = ByteBuffer.allocateDirect(FLOAT_SIZE * size * 2).order(ByteOrder.nativeOrder()).asFloatBuffer();

used as such:

buffer.put(index, f)

I have noticed that buffer.put() takes at least 10 times as long time as assigning an ordinary float array. This becomes a real bottleneck and the limiting factor as to how many particles I can have.

Has anyone noticed this problem or have any suggestions as to how to get around it?

Thanks,

Martin

Write everything to a float[] and use FloatBuffer.put(float[]) ?

Yes tried that. It had no effect. So I found the source code and noticed that that method just iterates over the array and calls pu(index, float) on each element.

Was wondering if there could be an alternative way to construct the FloatBuffer for OpenGL. Haven’t really been able to think one up though. From the source code it appear what is taking 10 times as long time is different checks and function calls. Nothing much but when applied 1000 times pr frame in a particle system it really ads up.

Create one FloatBuffer and slice() it in 1000 buffers.

But ehm… why would you want 1000 buffers per frame? Can’t you store all particles in the same buffer?

No, I meant the put is done many times pr frame. There is only one floatbuffer. But as the particles move each frame I have to update all positions in the floatbuffer.

Well, you said ‘construct’.

According to:
http://apistudios.com/hosted/marzec/badlogic/wordpress/?p=478

Heap-floatbuffers have better performance. Upon copying to a VBO, the ‘driver’ seems to make its own (fast) copy.

This actually helps tremendously on Android. I’ve no idea which it doesn’t in your case… ???

Hmmm it is probably hardware specific which implementation of the put method you get. I am running on a Nexus One. And after putting in breakpoints I could see that the implementation of put(float[]) I got was one that just traversed the array and called put(float) on each element.

This wouldn’t be the first time a profiler alters the optimisation of an application.

It’d be safe to assume you ran it without a profiler too?

Yes, wrote a small test app that times the different methods (one at a time, float[], FloatBuffer.wrap(float[]). One at a time is fastest, then float [], the slowest is to pass it a wrapped FloatBuffer.

Why not upload this test somewhere so that we can benchmark on various platforms!? My experience is that putting one float[] is between 500 and 600% faster than single puts…at least on 1.5. Maybe they have worked on singles puts in 2.1 or something.

BTW: Are you sure that you are measuring direct buffer performance? You can’t create a direct buffer by using wrap(…), can you?

Good idea Egon.

You can download source here:

http://games.martineriksen.net/PerformanceTest.zip

In the “bin” folder there is the APK which you can install using adb install. Then you can run the test by starting the PerformanceTest app on your phone. The test will output the data in LogCat - these are the interesting lines (seen on a Nexus One 2.1):

06-03 14:56:01.445: INFO/System.out(22956): time: 247.8s >> vertex buffer single puts
06-03 14:56:01.445: INFO/System.out(22956): time: 254.2s >> vertex buffer single puts with specified positions
06-03 14:56:01.445: INFO/System.out(22956): time: 264.3s >> vertex buffer full array puts
06-03 14:56:01.445: INFO/System.out(22956): time: 0.3s >> vertex buffer wrapping
06-03 14:56:01.445: INFO/System.out(22956): time: 285.3s >> wrapped array to vertex buffer

Here is the interesting code that runs this part of the test:



               FloatBuffer nativeDirectFloatBuffer = OpenGlMemoryUtil.makeFloatBuffer(FLOAT_BUFFER_SIZE);
               float[] floatArray = new float[FLOAT_BUFFER_SIZE];
               
               for (int i = 0; i < FLOAT_BUFFER_SIZE; i++) {
                       floatArray[i]=0.5f;
               }
               
               time = print("Going VertexBuffers");        
               for (int i = 0; i < TESTSIZE_MILLIONTH; i++) {
                       nativeDirectFloatBuffer.position(0);
                       for (int k = 0; k < FLOAT_BUFFER_SIZE; k++) {
                               nativeDirectFloatBuffer.put(0.5f);
                       }                        
               }
               time = PerfLogUtil.logTime(time, "vertex buffer single puts", logindex++, TESTSIZE_THOUSANDS);
               
               time = print("Going VertexBuffers");        
               for (int i = 0; i < TESTSIZE_MILLIONTH; i++) {
                       for (int k = 0; k < FLOAT_BUFFER_SIZE; k++) {
                               nativeDirectFloatBuffer.put(k,0.5f);
                       }                        
               }
               time = PerfLogUtil.logTime(time, "vertex buffer single puts with specified positions", logindex++, TESTSIZE_THOUSANDS);

               time = print("Going VertexBuffers");        
               for (int i = 0; i < TESTSIZE_MILLIONTH; i++) {
                       for (int k = 0; k < FLOAT_BUFFER_SIZE; k++) {
                               floatArray[k]=.5f;
                       }                        
                       nativeDirectFloatBuffer.position(0);
                       nativeDirectFloatBuffer.put(floatArray);
               }
               time = PerfLogUtil.logTime(time, "vertex buffer full array puts", logindex++, TESTSIZE_THOUSANDS);
               
               FloatBuffer floatBufferWrappedArray = FloatBuffer.wrap(floatArray);
               time = PerfLogUtil.checkPoint();
               for (int i = 0; i < TESTSIZE_MILLIONTH; i++) {
                       floatBufferWrappedArray = FloatBuffer.wrap(floatArray);
               }
               time = PerfLogUtil.logTime(time, "vertex buffer wrapping", logindex++, TESTSIZE_THOUSANDS);
               
               for (int i = 0; i < TESTSIZE_MILLIONTH; i++) {
                       for (int k = 0; k < FLOAT_BUFFER_SIZE; k++) {
                               floatArray[k]=.5f;
                       }                        
                       nativeDirectFloatBuffer.position(0);
                       floatBufferWrappedArray.position(0);
                       nativeDirectFloatBuffer.put(floatBufferWrappedArray);
               }
               time = PerfLogUtil.logTime(time, "wrapped array to vertex buffer", logindex++, TESTSIZE_THOUSANDS);


Please put your code between [ code ] and [/ code ], otherwise[ i ] will be converted into italic styled text.

Done :slight_smile: Thanks for the info Riven.

Tried it on my Samsung Galaxy with Android 1.5. The results are similar (but slower of course):


06-03 16:49:07.297: INFO/System.out(2166): time: 1137.7s >> vertex buffer single puts
06-03 16:49:07.297: INFO/System.out(2166): time: 1079.4s >> vertex buffer single puts with specified positions
06-03 16:49:07.297: INFO/System.out(2166): time: 1175.3s >> vertex buffer full array puts
06-03 16:49:07.297: INFO/System.out(2166): time: 1.4s >> vertex buffer wrapping
06-03 16:49:07.297: INFO/System.out(2166): time: 1220.6s >> wrapped array to vertex buffer

However, this changes once you add a variable instead of 0.5f, i.e. do something like this:


for (int i = 0; i < TESTSIZE_MILLIONTH; i++) {
	nativeDirectFloatBuffer.position(0);
	float val=0;
	for (int k = 0; k < FLOAT_BUFFER_SIZE; k++) {
		nativeDirectFloatBuffer.put(val);
		val+=0.1f;
	}			
}

This results in:


06-03 17:03:43.307: INFO/System.out(2782): time: 1387.5s >> vertex buffer single puts
06-03 17:03:43.307: INFO/System.out(2782): time: 1308.4s >> vertex buffer single puts with specified positions
06-03 17:03:43.307: INFO/System.out(2782): time: 1187.9s >> vertex buffer full array puts
06-03 17:03:43.317: INFO/System.out(2782): time: 1.4s >> vertex buffer wrapping
06-03 17:03:43.317: INFO/System.out(2782): time: 1192.8s >> wrapped array to vertex buffer

I’m still not sure why it helped that much more in my code (which is a bit more complex than this simple benchmark or course) to go with float[]s… ??? Dalvik is strange…slow and strange…

I’ve reverted my own stuff to use single puts to see what happens then…i have a loop with 6 “puts” in each iteration filling two different buffers. With float[] instead of single puts, this is 3 times faster on my device.

OK - is your code in a format that you can send so I can try and replicate your test run?

Also after instrumenting and further analysis I found that TraceView had exaggerated the cost of the buffer puts in relation to the whole programme execution. Traceview said that the buffer puts were 17% of all time spent whereas actually they are only 3%. So like Riven suggested the profiler was not quite truthful. It is still the case that the puts take 10 times longer than array puts - but the overall impact is lower than I thought.

Still it would be great to find a way to write the buffers faster.

/Martin

Sure. Code looks like this:


int ix=0;
for (c = 0; c < endII; c++) {
	vcoords[ix] = x[c];
	ncoords[ix++] = nx[c];
	vcoords[ix] = y[c];
	ncoords[ix++] = ny[c];
	vcoords[ix] = z[c];
	ncoords[ix++] = nz[c];
}

...

vertices.put(vcoords);
normals.put(ncoords);


Single put code looks the same except that in the loop i’m doing 3 puts into vertices and 3 into normals instead of filling the array.

I’ve just run into the bulk-put problem, and have noticed that IntBuffers do not suffer the same fate - bulk put( int[] ) calls are very quick. I’m seeing a x10 speedup using this for 10000-element arrays, and about x2 for 10 elements.


	private static int[] intArray = new int[ 0 ];

	/**
	 * Work-around for crappy {@link FloatBuffer#put(float[])}
	 * performance
	 * 
	 * @param buff
	 * @param data
	 */
	public static void put( IntBuffer buff, float[] data )
	{
		if( intArray.length < data.length )
		{
			intArray = new int[ data.length ];
		}

		for( int i = 0; i < data.length; i++ )
		{
			intArray[ i ] = Float.floatToIntBits( data[ i ] );
		}

		buff.put( intArray, 0, data.length );
	}