Comments:
- About 2D arrays vs 1D arrays:
1D arrays are faster for several reasons.
a) They are guaranteed to be contiguous. This means that they are much more likely to work well with your processor’s cache prefetch mechanism. In order to take the best advantage of this, you should process the pixel data in the order it is stored in memory. The importance of coding to the cache can’t be overstated.
b) Accessing a single array only costs 1 bounds check. In your case, you’re paying six (3*2) bounds checks each time you pull out a pixel when you should only be paying for one check - and if you write your code better, 0.
c) 2D arrays cost linear time to allocate in the 2nd dimension. 1D arrays cost O(1) to allocate.
- How you write optimizer friendly code for array bounds:
Write your for loops so that they process array elements linearly, i.e:
for ( int i = 0; i < myArray.length; ++i ) {
// array indexing in here…
}
Don’t play with the index in the loop. Do the actual array indexing in the same function as you’re doing the looping. If you must put the actual array indexing in a different function, make the function small enough that Hotspot will inline it.
-
You’re not setting an intial size for ByteArrayOutputStream, so it’s growing needlessly. In fact, you should probably get rid of ByteArrayOutputStream altogether and just allocate an array that you can guarantee is large enough for a compressed frame. This will get rid of a lot of needless copying and processing going on.
-
In BitOutputStream.write() you’re doing a for loop. Practically anywhere you’re performing a loop like that, you should go ahead and perform loop unrolling. You’ll have to play around with the amount of unrolling to see what works best.
-
In general, you want to be careful about making function calls. I know Jeff told you that function calls in Java are just as fast as than those in C++, but, even in C++, you avoid trying to make millions of function calls per second. (i.e. a function call per pixel in a 300K pixel image). HotSpot may or may not inline the function. but it’s less likely to do so if the function is large (because, in general, it’s bad policy - you end up overflowing the instruction cache much more often).
-
On comparing Java codec performance vs C++ codec performance:
A well written C++ codec is going to get all of its performance gains by managing the processor’s cache very carefully and exploiting parallelism through loop unrolling and SIMD instructions. It may also be able to avoid some array bounds checks that you may not.
God bless,
-Toby Reyelts