Fastest way of using FloatBuffers

So I’m not hijacking a different thread…

I’ve been asking Riven the fastest ways of using FloatBuffers after having some real performance issues with using them for a lot of values. I made a simple profiler and get really weird results.


Generating array of 524288 values to use...
    Done.
Copying all values into a new array, then bulk putting (using nonexisting array and FloatBuffer)...
    Took 0.011403 seconds.
Copying all values into a new array, then bulk putting (using existing array and FloatBuffer)...
    Took 0.015468 seconds.
Using index-based puts (using nonexisting FloatBuffer)...
    Took 0.006779 seconds.
Using individual puts (using nonexisting FloatBuffer)...
    Took 0.006532 seconds.

This looks totally opposite. Individual inserts are really fast and bulk inserts are quite slow. This is Java 1.6.

Okay wait. I just ran it again with more values and the timing is across the map.


Generating array of 8388608 values to use...
    Done.
Copying all values into a new array, then bulk putting (using nonexisting array and FloatBuffer)...
    Took 0.119186 seconds.
Copying all values into a new array, then bulk putting (using existing array and FloatBuffer)...
    Took 0.035243 seconds.
Using index-based puts (using nonexisting FloatBuffer)...
    Took 0.074629 seconds.
Using individual puts (using nonexisting FloatBuffer)...
    Took 0.058328 seconds.

Here is the really quick and dirty code. Is it a bad idea to use nanoTime? Am I not warming it up correctly?


import java.nio.FloatBuffer;

public class FloatBufferProfiler
{
	public static void main(String[] args)
	{
		final int ARRAY_SIZE = 8388608;
		final double NANO = 1000000000.0;
		
		System.out.println("Generating array of " + ARRAY_SIZE + " values to use...");
		
		float[] valueArray = new float[ARRAY_SIZE];
		for (int i = 0; i < valueArray.length; i++)
		{
			valueArray[i] = (float) (Math.random() * 100000);
		}
		
		System.out.println("    Done.");
		
		//Warm up the clock.
		long timePoll = System.nanoTime();
		timePoll = System.nanoTime();
		timePoll = System.nanoTime();
		timePoll = System.nanoTime();
		timePoll = System.nanoTime();
		timePoll = System.nanoTime();
		
		System.out.println("Copying all values into a new array, then bulk putting (using nonexisting array and FloatBuffer)...");
		float[] newValues = new float[ARRAY_SIZE];
		FloatBuffer buffer = FloatBuffer.allocate(ARRAY_SIZE);
		for (int i = 0; i < valueArray.length; i++)
		{
			newValues[i] = valueArray[i];
		}
		buffer.put(newValues);
		buffer.flip();
		System.out.println("    Took " + ((System.nanoTime() - timePoll) / NANO) + " seconds.");
		
		System.out.println("Copying all values into a new array, then bulk putting (using existing array and FloatBuffer)...");
		timePoll = System.nanoTime();
		buffer.clear();
		for (int i = 0; i < valueArray.length; i++)
		{
			newValues[i] = valueArray[i];
		}
		buffer.put(newValues);
		buffer.flip();
		System.out.println("    Took " + ((System.nanoTime() - timePoll) / NANO) + " seconds.");
		
		System.out.println("Using index-based puts (using nonexisting FloatBuffer)...");
		buffer.clear();
		buffer = null;
		timePoll = System.nanoTime();
		buffer = FloatBuffer.allocate(ARRAY_SIZE);
		for (int i = 0; i < valueArray.length; i++)
		{
			buffer.put(i, valueArray[i]);
		}
		buffer.flip();
		System.out.println("    Took " + ((System.nanoTime() - timePoll) / NANO) + " seconds.");
		
		System.out.println("Using individual puts (using nonexisting FloatBuffer)...");
		buffer.clear();
		buffer = null;
		timePoll = System.nanoTime();
		buffer = FloatBuffer.allocate(ARRAY_SIZE);
		for (int i = 0; i < valueArray.length; i++)
		{
			buffer.put(valueArray[i]);
		}
		buffer.flip();
		System.out.println("    Took " + ((System.nanoTime() - timePoll) / NANO) + " seconds.");
	}
}

The four situations I was testing are:

  • A new array and buffer is made every draw pass. All the values are copied over into the array one by one (simulating what you would be doing making draw calls from all over your code), then bulk copied to the buffer.
  • An existing array and buffer have already been made, so are in memory. The buffer is cleared. Then the above operation happens (basically the same as above with a clear instead of making a new buffer and array it just clears them out).
  • No array is used, a new buffer is made every draw pass. A bunch of individual puts happen, each one using the index value.
  • No array is used, a new buffer is made every draw pass. A bunch of individual puts happen, with no index value.

This seems to vary wildly based on the number of items I’m putting in. Wha?

Are you using the server VM?


$ java -version
java version "1.6.0_22"
Java(TM) SE Runtime Environment (build 1.6.0_22-b04-307-10M3261)
Java HotSpot(TM) 64-Bit Server VM (build 17.1-b03-307, mixed mode)

Yes. :slight_smile: I believe this is now the default in Mac OS X. There is no longer a client VM.

JDK 7 changes the game completely again as well :slight_smile: I will do a bit of benchmarking on my machine later and see what the differences are.

Hurrah for microbenchmarks!

Cas :slight_smile:

interesting but please try to get average numers from several runs (e.g 100 ).

I’m more likely to do about 1,000,000 runs…

Cas :slight_smile:

Here’s an updated test case that averages across multiple runs, and also adds a case where you do index puts into an existing buffer. Additionally, it uses direct float buffers since that’s what OpenGL libraries require.


import java.nio.FloatBuffer;
import java.nio.ByteBuffer;

public class FloatBufferProfiler
{
	public static void main(String[] args)
	{
		final int ARRAY_SIZE = 8388608;
		final int numTests = 100;
		final double NANO = 1000000000.0 * numTests;

		
		System.out.println("Generating array of " + ARRAY_SIZE + " values to use...");
		
		float[] valueArray = new float[ARRAY_SIZE];
		for (int i = 0; i < valueArray.length; i++)
		{
			valueArray[i] = (float) (Math.random() * 100000);
		}
		
		System.out.println("    Done.");
		
		//Warm up the clock.
		long timePoll = System.nanoTime();
		timePoll = System.nanoTime();
		timePoll = System.nanoTime();
		timePoll = System.nanoTime();
		timePoll = System.nanoTime();
		timePoll = System.nanoTime();
		
		{
			System.out.println("Copying all values into a new array, then bulk putting (using nonexisting array and FloatBuffer)...");
			for (int u = 0; u < numTests; u++) {
				float[] newValues = new float[ARRAY_SIZE];
				FloatBuffer buffer = ByteBuffer.allocateDirect(ARRAY_SIZE * 4).asFloatBuffer();
				for (int i = 0; i < valueArray.length; i++)
				{
					newValues[i] = valueArray[i];
				}
				buffer.put(newValues);
				buffer.flip();
			}
			System.out.println("    Took " + ((System.nanoTime() - timePoll) / NANO) + " seconds.");
		}

		{
			System.out.println("Copying all values into a new array, then bulk putting (using existing array and FloatBuffer)...");
			float[] newValues = new float[ARRAY_SIZE];
			FloatBuffer buffer = ByteBuffer.allocateDirect(ARRAY_SIZE * 4).asFloatBuffer();
			timePoll = System.nanoTime();
			for (int u = 0; u < numTests; u++) {
				buffer.clear();
				for (int i = 0; i < valueArray.length; i++)
				{
					newValues[i] = valueArray[i];
				}	
				buffer.put(newValues);
				buffer.flip();
			}
			System.out.println("    Took " + ((System.nanoTime() - timePoll) / NANO) + " seconds.");
		}
		
		{
			System.out.println("Using index-based puts (using nonexisting FloatBuffer)...");
			timePoll = System.nanoTime();
			for (int u = 0 ; u < numTests; u++) {
				FloatBuffer buffer = ByteBuffer.allocateDirect(ARRAY_SIZE * 4).asFloatBuffer();
				for (int i = 0; i < valueArray.length; i++)
				{
					buffer.put(i, valueArray[i]);
				}
				buffer.flip();
			}
			System.out.println("    Took " + ((System.nanoTime() - timePoll) / NANO) + " seconds.");
		}

		{
			System.out.println("Using individual puts (using nonexisting FloatBuffer)...");
			timePoll = System.nanoTime();
			for (int u = 0; u < numTests; u++) {
				FloatBuffer buffer = ByteBuffer.allocateDirect(ARRAY_SIZE * 4).asFloatBuffer();
				for (int i = 0; i < valueArray.length; i++)
				{
					buffer.put(valueArray[i]);
				}
				buffer.flip();
			}
			System.out.println("    Took " + ((System.nanoTime() - timePoll) / NANO) + " seconds.");
		}

		{
			System.out.println("Using index-based puts (using existing FloatBuffer)...");
			timePoll = System.nanoTime();
			FloatBuffer buffer = ByteBuffer.allocateDirect(ARRAY_SIZE * 4).asFloatBuffer();
			for (int u = 0 ; u < numTests; u++) {
				buffer.clear();
				for (int i = 0; i < valueArray.length; i++)
				{
					buffer.put(i, valueArray[i]);
				}
				buffer.flip();
			}
			System.out.println("    Took " + ((System.nanoTime() - timePoll) / NANO) + " seconds.");
		}
	}
}

My results (also on a Mac with Java 6 are):



Generating array of 8388608 values to use...
    Done.
Copying all values into a new array, then bulk putting (using nonexisting array and FloatBuffer)...
    Took 0.06889852 seconds.
Copying all values into a new array, then bulk putting (using existing array and FloatBuffer)...
    Took 0.03462399 seconds.
Using index-based puts (using nonexisting FloatBuffer)...
    Took 0.10300789 seconds.
Using individual puts (using nonexisting FloatBuffer)...
    Took 0.11679034 seconds.
Using index-based puts (using existing FloatBuffer)...
    Took 0.05540613 seconds.

Some ideas for better micro-benchmarking:

  1. Try to use Caliper or at least extract each test in its own method and add some warm-up runs.
  2. Use a much smaller ARRAY_SIZE, something that’s representative of fp data in a game. Increase the number of tests runs accordingly.
  3. For the buffer put loops, use buffer.limit() (or .remaining()) instead of valueArray.length. In the array loops, you’re making it easy for the VM to remove array bounds checks, but you aren’t doing the same in the buffer loops.
  4. Add tests that use random indices (instead of going from 0 to array/buffer length). Would be interesting to see the differences then.

Btw, I have developed a clone of the NIO buffer API that uses sun.misc.Unsafe, supports direct buffers only and allows disabling all bounds checks (ala org.lwjgl.util.NoChecks=true). For random access it’s several times faster than normal NIO and it’s just as fast for the easy cases. I’m using it with a private LWJGL build that has been modified to support it.

ah, Spasi’s secret LWJGL Ninja Edition :slight_smile:

Oooh, I would love to use that.

Good point on the smaller array size, I didn’t get to the point of averaging runs together so I was just increasing the array size to try to get more “accurate” data. Like I said, quick and dirty (especially crap like copy pasting System.nanoTime() a bunch instead of using a for loop). :slight_smile:

But in hindsight it’s definitely best to have an accurate benchmark, or you’re just wasting your time. Looks like lhkbob’s results are more in line with what we would expect.

I hacked up the code a bit more so that the timer gets “more warmed up” and everything is now called through functions. And it prints the time in ms, and there are smaller arrays and more iterations.

With a 50,000 item array and 50,000 iterations:


Generating array of 50000 values to use...
    Done.
Copying all values into a new array, then bulk putting (using existing array and FloatBuffer)...
    Took 0.21014882 milliseconds.
Copying all values into a new array, then bulk putting (using nonexisting array and FloatBuffer)...
    Took 0.42391662 milliseconds.
Using individual puts (using existing FloatBuffer)...
    Took 0.20898886 milliseconds.
Using individual puts (using nonexisting FloatBuffer)...
    Took 0.58338562 milliseconds.
Using index-based puts (using existing FloatBuffer)...
    Took 0.26990588 milliseconds.
Using index-based puts (using nonexisting FloatBuffer)...
    Took 0.6380724 milliseconds.

With 40,000 items (10,000 quads equivalent) and 1,000,000 iterations:


Generating array of 40000 values to use...
    Done.
Copying all values into a new array, then bulk putting (using existing array and FloatBuffer)...
    Took 0.119918364 milliseconds.
Copying all values into a new array, then bulk putting (using nonexisting array and FloatBuffer)...
    Took 0.254571916 milliseconds.
Using individual puts (using existing FloatBuffer)...
    Took 0.110950286 milliseconds.
Using individual puts (using nonexisting FloatBuffer)...
    Took 0.381005605 milliseconds.
Using index-based puts (using existing FloatBuffer)...
    Took 0.155216281 milliseconds.
Using index-based puts (using nonexisting FloatBuffer)...
    Took 0.412920993 milliseconds.

New code:


import java.nio.FloatBuffer;
import java.nio.ByteBuffer;

public class FloatBufferProfiler
{
	public static final int ARRAY_SIZE = 40000;
	public static final double NANO_TO_MILLI = 1000000.0;
	public static final int NUM_TESTS = 1000000;
	
	public static void main(String[] args)
	{
		float[] valueArray = generateValueArray();
		warmUpClock(10000);
		bulkPut(true, valueArray);
		bulkPut(false, valueArray);
		singlePut(true, valueArray);
		singlePut(false, valueArray);
		singlePutIndexed(true, valueArray);
		singlePutIndexed(false, valueArray);
	}
	
	private static float[] generateValueArray()
	{
		System.out.println("Generating array of " + ARRAY_SIZE + " values to use...");
		float[] valueArray = new float[ARRAY_SIZE];
		for (int i = 0; i < valueArray.length; i++)
		{
			valueArray[i] = (float) (Math.random() * 100000);
		}
		System.out.println("    Done.");
		return valueArray;
	}
	
	private static void warmUpClock(int iterations)
	{
		long timePoll = System.nanoTime();
		for (int i = 0; i < iterations; i++)
		{
			timePoll = System.nanoTime();
		}
	}
	
	private static void bulkPut(boolean existingBuffer, float[] valueArray)
	{
		System.out.println("Copying all values into a new array, then bulk putting (using " + (existingBuffer ? "existing" : "nonexisting") + " array and FloatBuffer)...");
		
		float[] newValues = null;
		FloatBuffer buffer = null;
		
		if (existingBuffer)
		{
			newValues = new float[ARRAY_SIZE];
			buffer = ByteBuffer.allocateDirect(ARRAY_SIZE * 4).asFloatBuffer();
		}
		
		long timePoll = System.nanoTime();
		
		for (int u = 0; u < NUM_TESTS; u++)
		{
			if (existingBuffer)
			{
				buffer.clear();
			}
			else
			{
				newValues = new float[ARRAY_SIZE];
				buffer = ByteBuffer.allocateDirect(ARRAY_SIZE * 4).asFloatBuffer();
			}
			
			for (int i = 0; i < valueArray.length; i++)
			{
				newValues[i] = valueArray[i];
			}
			buffer.put(newValues);
			buffer.flip();
		}
		
		System.out.println("    Took " + ((System.nanoTime() - timePoll) / (NANO_TO_MILLI * NUM_TESTS)) + " milliseconds.");
	}
	
	private static void singlePut(boolean existingBuffer, float[] valueArray)
	{
		System.out.println("Using individual puts (using " + (existingBuffer ? "existing" : "nonexisting") + " FloatBuffer)...");
		
		FloatBuffer buffer = null;
		
		if (existingBuffer)
		{
			buffer = ByteBuffer.allocateDirect(ARRAY_SIZE * 4).asFloatBuffer();
		}
		
		long timePoll = System.nanoTime();
		
		for (int u = 0; u < NUM_TESTS; u++)
		{
			if (existingBuffer)
			{
				buffer.clear();
			}
			else
			{
				buffer = ByteBuffer.allocateDirect(ARRAY_SIZE * 4).asFloatBuffer();
			}
			
			for (int i = 0; i < valueArray.length; i++)
			{
				buffer.put(valueArray[i]);
			}
			buffer.flip();
		}
		System.out.println("    Took " + ((System.nanoTime() - timePoll) / (NANO_TO_MILLI * NUM_TESTS)) + " milliseconds.");
	}
	
	private static void singlePutIndexed(boolean existingBuffer, float[] valueArray)
	{
		System.out.println("Using index-based puts (using " + (existingBuffer ? "existing" : "nonexisting") + " FloatBuffer)...");
		
		FloatBuffer buffer = null;
		
		if (existingBuffer)
		{
			buffer = ByteBuffer.allocateDirect(ARRAY_SIZE * 4).asFloatBuffer();
		}
		
		long timePoll = System.nanoTime();
		
		for (int u = 0 ; u < NUM_TESTS; u++)
		{
			if (existingBuffer)
			{
				buffer.clear();
			}
			else
			{
				buffer = ByteBuffer.allocateDirect(ARRAY_SIZE * 4).asFloatBuffer();
			}
			
			for (int i = 0; i < valueArray.length; i++)
			{
				buffer.put(i, valueArray[i]);
			}
			buffer.flip();
		}
		System.out.println("    Took " + ((System.nanoTime() - timePoll) / (NANO_TO_MILLI * NUM_TESTS)) + " milliseconds.");
	}
}

Here is a table of fasty-ness!


Using individual puts (using existing FloatBuffer)
    is the fastest!
Copying all values into a new array, then bulk putting (using existing array and FloatBuffer):
    takes 8% longer.
Using index-based puts (using existing FloatBuffer)
    takes 40% longer.
Copying all values into a new array, then bulk putting (using nonexisting array and FloatBuffer)
    takes 229% longer.
Using individual puts (using nonexisting FloatBuffer)
    takes 343% longer.
Using index-based puts (using nonexisting FloatBuffer)
    takes 372% longer.

Interestingly enough, this means that just using a plain old put() is the fastest method (with this number of vertices). I’m not sure why index-based puts would be slower, but they appear to be. The difference between the bulk put and the single non-indexed pushes are pretty minor, but looks like you should definitely avoid index-based puts. Especially notable - absolutely keep your FloatBuffer in memory and then just clear() it every time you want to put new stuff in it. That is magnitudes faster.

Here’s a run with only 5,000 vertices, note that the array method becomes faster in this case.


Generating array of 5000 values to use...
    Done.
Copying all values into a new array, then bulk putting (using existing array and FloatBuffer)...
    Took 0.013873423 milliseconds.
Copying all values into a new array, then bulk putting (using nonexisting array and FloatBuffer)...
    Took 0.028957902 milliseconds.
Using individual puts (using existing FloatBuffer)...
    Took 0.014773814 milliseconds.
Using individual puts (using nonexisting FloatBuffer)...
    Took 0.0437836 milliseconds.
Using index-based puts (using existing FloatBuffer)...
    Took 0.018534608 milliseconds.
Using index-based puts (using nonexisting FloatBuffer)...
    Took 0.047237627 milliseconds.

Another take away is that high-level graphics engines aren’t really required to use FloatBuffers in their interfaces. They can have geometry, etc. represented as plain old arrays and then keep a cached buffer around behind the scenes when they need to talk with OpenGL. This is nice because you don’t have to worry about the user screwing the direct-ness or endian-ness of the buffer anymore.

Here are the updated timings from my Mac (10.6.5, 2.53 GHz i5 with Java 1.6_22):


Generating array of 40000 values to use...
    Done.
Copying all values into a new array, then bulk putting (using existing array and FloatBuffer)...
    Took 0.12576384 milliseconds.
Copying all values into a new array, then bulk putting (using nonexisting array and FloatBuffer)...
    Took 0.2772573 milliseconds.
Using individual puts (using existing FloatBuffer)...
    Took 0.14233947 milliseconds.
Using individual puts (using nonexisting FloatBuffer)...
    Took 0.41132515 milliseconds.
Using index-based puts (using existing FloatBuffer)...
    Took 0.16847472 milliseconds.
Using index-based puts (using nonexisting FloatBuffer)...
    Took 0.44021002 milliseconds.

and here is the same benchmark run on Ubuntu 8.04 Hardy (Intel Duo 3.16 GHz, Java 1.6_21):


Generating array of 40000 values to use...
    Done.
Copying all values into a new array, then bulk putting (using existing array and FloatBuffer)...
    Took 0.1619271662 milliseconds.
Copying all values into a new array, then bulk putting (using nonexisting array and FloatBuffer)...
    Took 0.29461064228 milliseconds.
Using individual puts (using existing FloatBuffer)...
    Took 0.12362605117 milliseconds.
Using individual puts (using nonexisting FloatBuffer)...
    Took 0.2110052842 milliseconds.
Using index-based puts (using existing FloatBuffer)...
    Took 0.16837437908 milliseconds.
Using index-based puts (using nonexisting FloatBuffer)...
    Took 0.24990385835 milliseconds.

Both of these used 40000 long arrays/buffers and only 100,000 test runs because I got bored :confused: I don’t know why there was such a huge performance hit for individual and index puts with nonexisting buffers on Mac. Either way, using existing arrays + bulk puts is a viable option it seems on both OS’s.

Your benchmark code is flawed. You are not warming up the test code and you aren’t setting the native ByteOrder on the buffer you use. This is the modified code:

package org.lwjgl;

import java.nio.ByteBuffer;
import java.nio.ByteOrder;
import java.nio.FloatBuffer;
import java.util.Random;

public class FloatBufferProfiler {

	private static final int RANDOM_SEED = 1023;
	private static final int ARRAY_SIZE = 40000;
	private static final double NANO_TO_MILLI = 1000000.0;

	private static final int WARMUP_RUNS = 5;
	private static final int TEST_RUNS = 10;
	private static final int LOOPS_PER_RUN = 10000;

	public static void main(String[] args) {
		FloatBuffer buffer = generateBuffer();

		System.out.println("FloatBuffer implementation: " + buffer.getClass().getName());

		System.out.println("\n---------------------\n");

		System.out.println("Warming up...");
		System.out.println("\tClock warmed up: " + warmUpClock(10000));
		runTest(WARMUP_RUNS, true);
		System.out.println("\tDone.");

		System.out.println("\n---------------------\n");

		runTest(TEST_RUNS, false);
	}

	private static void runTest(final int runs, final boolean warmup) {
		float[] values = generateValues();
		float[] newValues = generateNewValues();
		FloatBuffer buffer = generateBuffer();

		long bulkPutOld = 0,
			singlePutOld = 0,
			singlePutIndexedOld = 0;

		/*
		long bulkPutNew = 0,
			singlePutNew = 0,
			singlePutIndexedNew = 0;
		*/

		for ( int i = 0; i < runs; i++ ) {
			bulkPutOld += bulkPut(true, values, newValues, buffer);
			//bulkPutNew += bulkPut(false, values, newValues, null);
			singlePutOld += singlePut(true, values, buffer);
			//singlePutNew += singlePut(false, values, null);
			singlePutIndexedOld += singlePutIndexed(true, values, buffer);
			//singlePutIndexedNew += singlePutIndexed(false, values, null);
		}

		if ( !warmup ) {
			System.out.println("Copying all values into a new array, then bulk putting (using existing array and FloatBuffer)...");
			printTime(bulkPutOld / runs);

			//System.out.println("Copying all values into a new array, then bulk putting (using nonexisting array and FloatBuffer)...");
			//printTime(bulkPutNew / runs);

			System.out.println("Using individual puts (using existing FloatBuffer)...");
			printTime(singlePutOld / runs);

			//System.out.println("Using individual puts (using nonexisting FloatBuffer)...");
			//printTime(singlePutNew / runs);

			System.out.println("Using index-based puts (using existing FloatBuffer)...");
			printTime(singlePutIndexedOld / runs);

			//System.out.println("Using index-based puts (using nonexisting FloatBuffer)...");
			//printTime(singlePutIndexedNew / runs);
		}
	}

	private static void printTime(long time) {
		System.out.println("\tTook " + Double.toString((time) / (NANO_TO_MILLI * LOOPS_PER_RUN)) + " milliseconds.");
	}

	private static float[] generateValues() {
		float[] valueArray = new float[ARRAY_SIZE];
		Random rand = new Random(RANDOM_SEED);
		for ( int i = 0; i < valueArray.length; i++ ) {
			valueArray[i] = rand.nextFloat() * 100000.0f;
		}
		return valueArray;
	}

	private static float[] generateNewValues() {
		return new float[ARRAY_SIZE];
	}

	private static FloatBuffer generateBuffer() {
		return ByteBuffer.allocateDirect(ARRAY_SIZE * 4).order(ByteOrder.nativeOrder()).asFloatBuffer();
	}

	private static long warmUpClock(int iterations) {
		long timePoll = System.nanoTime();
		for ( int i = 0; i < iterations; i++ ) {
			timePoll += System.nanoTime();
		}
		return timePoll;
	}

	private static long bulkPut(boolean existingBuffer, float[] values, float[] newValues, FloatBuffer buffer) {
		long timePoll = System.nanoTime();

		for ( int u = 0; u < LOOPS_PER_RUN; u++ ) {
			if ( !existingBuffer ) {
				newValues = generateNewValues();
				buffer = generateBuffer();
			}

			//System.arraycopy(values, 0, newValues, 0, values.length);
			for ( int i = 0; i < values.length; i++ )
				newValues[i] = values[i];

			buffer.put(newValues);
			buffer.flip();
		}

		return System.nanoTime() - timePoll;
	}

	private static long singlePut(boolean existingBuffer, float[] values, FloatBuffer buffer) {
		long timePoll = System.nanoTime();

		for ( int u = 0; u < LOOPS_PER_RUN; u++ ) {
			if ( !existingBuffer )
				buffer = generateBuffer();

			buffer.position(0);
			buffer.limit(values.length);
			for ( int i = 0; i < values.length; i++ )
				buffer.put(values[i]);

			buffer.flip();
		}

		return System.nanoTime() - timePoll;
	}

	private static long singlePutIndexed(boolean existingBuffer, float[] values, FloatBuffer buffer) {
		long timePoll = System.nanoTime();

		for ( int u = 0; u < LOOPS_PER_RUN; u++ ) {
			if ( !existingBuffer )
				buffer = generateBuffer();

			for ( int i = 0; i < values.length; i++ )
				buffer.put(i, values[i]);
		}

		return System.nanoTime() - timePoll;
	}

}

I have commented out the tests that recreate the buffer, since I found that the re-allocations take the majority of the time. Feel free to un-comment and test it if you want. In my tests, index-based puts were faster, followed by bulk puts, followed by individual puts. Results:

FloatBuffer implementation: java.nio.DirectFloatBufferU
---------------------
Copying all values into a new array, then bulk putting (using existing array and
 FloatBuffer)...
        Took 0.0421260103 milliseconds.
Using individual puts (using existing FloatBuffer)...
        Took 0.0820538328 milliseconds.
Using index-based puts (using existing FloatBuffer)...
        Took 0.0326686016 milliseconds.

An interesting thing I noticed was that if you do the following:

buffer.position(0);
buffer.limit(values.length);

before the individual put loop, you get a 20-25% speed-up. With those two lines un-commented:

Copying all values into a new array, then bulk putting (using existing array and
 FloatBuffer)...
        Took 0.0421146698 milliseconds.
Using individual puts (using existing FloatBuffer)...
        Took 0.0607214925 milliseconds.
Using index-based puts (using existing FloatBuffer)...
        Took 0.0327218355 milliseconds.

So basically you can easily get rid of the bounds checks, but it’s still slower than index-based puts because you’re updating the current buffer position on every put (see the package-private nextPutIndex() in java.nio.Buffer).