BufferedImage backed by a ByteBuffer

hi

I have written a BufferedImage extension, that uses a special DataBuffer implementation, that stores its data directly in a DirectByteBuffer. I need this to avoid data duplication in memory. The image data needs to be sent to OpenGL, which can only be done through the ByteBuffer. To keep the image updatable/redrawable, I need the linkage between the BufferedImage and the ByteBuffer. Therefore I need this solution.

I currently have a working solution, but it is incredibly slow. This led me to this nice section of these boards ;).

I have implemented it by creating a WritableRaster extension, that takes a PixelInterleavedSampleModel (instantiated just as BufferedImage does) and my DirectDataBufferByte. The documentation says, that the ByteInterleavedRaster is being used by a BufferedImage for byte-data to improve performance. I would like to use it, too, but it doesn’t accept my DirectDataBufferByte (the one, that is backed by a ByteBuffer). Unfortunately the source of ByteInterleavedRaster and its parent classes doesn’t seem to be available, so I cannot loopup, what they do to improve performance.

The only difference between my DirectBufferedImage and the regular BufferedImage (with bytes) is the used WritableRaster. But the original BufferedImage is 3x as fast as my DirectBufferedImage. And writing the data to a byte array or a ByteBuffer doesn’t make any difference in performance. So, I guess, I only need a better WritableRaster implementation.

I hope, I was clear enough and someone can help me out here. If there are any further questions, please don’t hesitate to ask.

Thanks in advance,

Marvin

First, make a tiny test-case where only writing to the BufferedImage is done with your DataBufferByte impl.

It should be commandline, and do really nothing else.

Why am I stressing this? If any code in the JVM creates a HeapByteBuffer using ByteBuffer.allocate(…), the performance of the direct buffers can lose performance by factor 10, seriously - it has to unoptimize from extremely fast pointer-access to a jumptable with subclasses, now that there is more than 1 subclass of ByteBuffer.

Further, the Client VM is rather poor at accessing direct buffers. Switch to Server VM which can speed up buffer performance by factor 3 or so.

You might want to post your findings… :wink:

If it would be a “normal” method call, it would go from monomorphic to a bimorphic call which is something like if/else + uncommon trap handling.

However this stuff is totally instrified (at least in the server-vm), so I guess it will not be treated like a normal method.

lg Clemens

Yeah, well, try and see.

I’ve seen horrific performance degradation simply because I put a ByteBuffer.allocate(1) as the first line in my program.

Hey. Thanks a lot for your replies, guys.

I do have a testcase like this. Actually I tested it like this before I posted my request. I even made a testcase, where I filled a ByteBuffer with random values and a byte-array on the other hand. The ByteBuffer and the array were of size 196608 both and I filled them 500 times. The ByteBuffer version was about 4% slower. So I guess, there everythign right with the “directness” of the ByteBuffer.

Then I tried something else. I created a BufferedImage, which was backed by my own instance of DataBufferByte, where I passed in my own byte-array. I created it the exact same way as the original BufferedImage constructor does. And it is exactly as fast as a standard BufferedImage. But if I use a byte-offset, that differs from (2, 1, 0) (I would need (0, 1, 2) for OpenGL texture-data), it is as slow as the ByteBuffer version.

So it seems like the BufferedImage doesn’t use its best optimized code, if anything differs from the way, they would create it by default.

Here is my testcase:
http://jagatoo.svn.sourceforge.net/viewvc/jagatoo/trunk/test/src/org/jagatoo/test/util/image/DirectBufferedImageTest.java?view=markup
http://jagatoo.svn.sourceforge.net/viewvc/jagatoo/trunk/src/org/jagatoo/image/

Any idea, how to make the BufferedImage use its optimized code for ByteBuffers or (0, 1, 2, 3)-offsetted byte-array, too?

Marvin

[quote]Any idea, how to make the BufferedImage use its optimized code for ByteBuffers or (0, 1, 2, 3)-offsetted byte-array, too?
[/quote]
Well, I wouldn’t know, as BufferedImage is like a blackbox to me.

Do I understand correctly that a backing byte[] with swapped offsets also is acceptable to you (no ByteBuffer at all…?) In that case, the byte[] has a-really-big-chance to be copied in JNI calls that read from the byte[] and sent to the gfx-driver. The overhead would be comparable to simply copying (and swapping) the bytes in Java code. Then you can just as well render with wrong offsets with full acceleration, then copy the bytes into a DirectByteBuffer and swap the bytes around.

You might want to measure what’s the overhead in that - it might be much less than the factor 3 slowdown you’re seeing now.

// try to get rid of “p++” as it ‘prevents’ out-of-order execution in the CPU.
bb.put(p+0, arr[p+3]);
bb.put(p+1, arr[p+2]);
bb.put(p+2, arr[p+1]);
bb.put(p+3, arr[p+0]);
p+=4;

Yes, actually this is quite my current solution ;). I am using the SharedBufferedImage (with wrong byte-offsets), that you can find in the link above. The usage of these SharedBufferedImages is of rather rare use at the moment. So this is not a eral problem anyway. The point is just, that I would prefer the “perfect” solution, if it was possible :).

You you please explain that a little more in detail? What is an out-of-order execution?

Marvin

You might want to read wikipedia about out-of-order execution.

It’s basically like this:
Simple mathematical instructions take only 1 clockcycle, while fecthing a byte from memory, can take up to a few dozen clockcycles. This is why the CPU will change the execution-order in which the instructions can be performed (when the results would be the same (valid) as in-order-execution). You can change the execution order of this example, without a problem:
x = a + b;
y = b - a;

You can’t however, change the execution-order of these instructions, without changing the outcome:
a = a + b;
y = b - a;

Let’s say you have this code:


for(int p=0; p<len; p++)
{
   int q = p+3;
   bb.put(p++, arr[q--]);
   bb.put(p++, arr[q--]);
   bb.put(p++, arr[q--]);
   bb.put(p, arr[q]);
}

There the effect 2nd line is dependent on the first line. It can’t be executed out-of-order. In C/C++ we have fancy compilers that optimize this away, but in Java, I often* see a nice performance boost when turning the code into:


for(int p=0; p<len; p += 4)
{
   bb.put(p+0, arr[p+3]);
   bb.put(p+1, arr[p+2]);
   bb.put(p+2, arr[p+1]);
   bb.put(p+3, arr[p+0]);
}

Where all lines can be executed in any order, allowing the CPU to perform the operation in optimal order, depending on memory latency and whether the data is in cache or not.

* the JIT in the HotSpot VM is not always predictable, so your loop might suddenly be twice as fast, or you might not see that much of a difference.

Edit:
Further, the VM seems to benefit (10-20%) from manual loop-unrolling.


//if((len % 8) != 0)
if((len & 7) != 0)
  throw new IllegalStateException();
for(int p=0; p<len; p += 8)
{
   bb.put(p+0, arr[p+3]);
   bb.put(p+1, arr[p+2]);
   bb.put(p+2, arr[p+1]);
   bb.put(p+3, arr[p+0]);

   bb.put(p+4, arr[p+7]);
   bb.put(p+5, arr[p+6]);
   bb.put(p+6, arr[p+5]);
   bb.put(p+7, arr[p+4]);
}

Thanks. This is interesting stuff.

Marvin