Fastest texture generating algorithm

This algorithm will produce a texture, given an image.

the result will both be in texImage, as well as in a textureCompatible buffer, which can be fed to glTexSubImage2D, glTexImage2D or gluBuild2DMipmaps

Basically, for any given image I, it will generate an ARGB image Y, which is a scaled ARGB version (to the nearest power of 2) of I. Testing shows it is about 16 times faster than a similar graphics.drawImage() scaling operation, even with all “dd” performance flags set on the VM. It uses fixed point arithmetics to achieve maximum speed.

I needed this because I was rendering video to a texture, and the java library algorithms were too slow.

in case you are wondering what the fastest way to update a texture in gl is, it is glTexSubImage2D, and you can use it as such, in combination with the algorithm:


gl.glTexSubImage2D(GL.GL_TEXTURE_2D, 0, 0, 0, textureWidth, textureHeight, GL.GL_RGBA, GL.GL_UNSIGNED_BYTE, textureCompatibleBuffer);

anyway, I hope it is usefull to you, and if you find other optimizations I haven’t thought of, share it with us here, please 8)

you will need the “get2fold” algorithm, as mentioned in this post:
http://www.java-gaming.org/cgi-bin/JGNetForums/YaBB.cgi?board=share;action=display;num=1117186907


// variables used

private BufferedImage       image;
private WritableRaster       raster;
private BufferedImage       texImage;
private byte[]                  iBuffer;
private byte[]                  tBuffer;
private ByteBuffer            textureCompatibleBuffer;
private int[]                   bankOffsets;
private int                        textureWidth = -1;
private int                        textureHeight = -1;
private int                        xScaleUnit;
private int                        yScaleUnit;
private ComponentSampleModel imageSampleModel;
private int                   scanLineStride;
private int                        pixelStride;

public static final ColorModel glAlphaColorModel = 
      new ComponentColorModel(ColorSpace.getInstance(ColorSpace.CS_sRGB),
        new int[] {8,8,8,8},
        true,
        false,
        ComponentColorModel.TRANSLUCENT,
        DataBuffer.TYPE_BYTE);

// start code

// only generate the raster and the buffer the first time
if(raster == null) { 
      // generate a texture compatible image size, large enough to hold the image
      textureWidth = get2Fold(image.getWidth());
      textureHeight = get2Fold(image.getHeight());
      
      // generate an interleaved byte-based ARGB raster and image
      raster = Raster.createInterleavedRaster(DataBuffer.TYPE_BYTE,this.textureWidth,this.textureHeight,4,null);
      texImage = new BufferedImage(glAlphaColorModel,raster,false,new Hashtable());
   
      // get the pointers to the data of the [t]exture buffer
      // and the data of the [i]image buffer
      iBuffer = ((DataBufferByte) image.getRaster().getDataBuffer()).getData(); 
      tBuffer= ((DataBufferByte) texImage.getRaster().getDataBuffer()).getData(); 
      
      // get information on how the image is stored in the buffer
      ComponentSampleModel imageSampleModel = (ComponentSampleModel) image.getSampleModel();
      scanLineStride = imageSampleModel.getScanlineStride();
      pixelStride = imageSampleModel.getPixelStride();
      bankOffsets = imageSampleModel.getBandOffsets();
      
      // generate a fixed point floating point number
      // which will allow us to calculate for a given x or y on the texture
      // where the source pixel on the image is
      xScaleUnit = (int) (((double) image.getWidth() / (double) textureWidth) * 65536);
      yScaleUnit = (int) (((double) image.getHeight() / (double) textureHeight) * 65536);
            
      // generate a bytebuffer to store the resulting image
      textureCompatibleBuffer = ByteBuffer.allocateDirect(tBuffer.length); 
      textureCompatibleBuffer.order(ByteOrder.nativeOrder()); 
}      


int adr = 0; // the address in the texture buffer
int xOffset = 0; // the x coordinate in the image
int bufferOffset = 0; // the final buffer offset in the image
int yOffset = 0; // the y coordinate in the image
int yBufferOffset = 0; // a temp value containing the start of the scanline in the image
if(bankOffsets.length > 3) { // RGBA or ABGR images
      for(int y=0; y<textureHeight; y++) {
            xOffset = 0;

            yBufferOffset = (yOffset >> 16) * scanLineStride;
            
            for(int x=0; x<textureWidth; x++) {                  
                  bufferOffset = yBufferOffset  + (xOffset >> 16) * pixelStride;

                  tBuffer[adr++] = iBuffer[bufferOffset + bankOffsets[0]];
                  tBuffer[adr++] = iBuffer[bufferOffset + bankOffsets[1]];
                  tBuffer[adr++] = iBuffer[bufferOffset + bankOffsets[2]];
                  tBuffer[adr++] = iBuffer[bufferOffset + bankOffsets[3]];
      
            xOffset += xScaleUnit;
      }
      yOffset += yScaleUnit;
}
} else { // RGB or BGR images
      for(int y=0; y<textureHeight; y++) {
            xOffset = 0;

            yBufferOffset = (yOffset >> 16) * scanLineStride;
            
            for(int x=0; x<textureWidth; x++) {                  
                  bufferOffset = yBufferOffset  + (xOffset >> 16) * pixelStride;

                  tBuffer[adr++] = iBuffer[bufferOffset + bankOffsets[0]];
                  tBuffer[adr++] = iBuffer[bufferOffset + bankOffsets[1]];
                  tBuffer[adr++] = iBuffer[bufferOffset + bankOffsets[2]];
                  tBuffer[adr++] = -1; // -1 signed = 255 unsigned
                  
                  xOffset += xScaleUnit;
            }
            yOffset += yScaleUnit;
      }
}
textureCompatibleBuffer.rewind();
textureCompatibleBuffer.put(tBuffer, 0, tBuffer.length);

Thanks for the code! Nice work.

However, if you’re looking for performance it might be better to write specific functions and take some of the branches out. The RGB vs RGBA branch for instance.

Kev

Thank you for your input.

I seperated the initialization and took out the (if/then ARGB) branches, but the speed increase is barely noticable. The results my test program:

5000x

with branches
run 1
19578 milliseconds
run 2
19953 milliseconds

without branches
run 1
19531 milliseconds
run 2
19640 milliseconds

If you want to increase your performance bigtime:

Compare these three ways of doing the same thing:

A (your array-filler)

             srcIndex = 0;
            dstIndex = 0;

            for (int i = 0; i < count; i++)
            {
               dst[dstIndex++] = src[srcIndex++];
               dst[dstIndex++] = src[srcIndex++];
               dst[dstIndex++] = src[srcIndex++];
               dst[dstIndex++] = src[srcIndex++];
            }

B (slightly adjusted, avoiding integer-increments)

             srcIndex = 0;
            dstIndex = 0;

            for (int i = 0; i < count; i++)
            {
               dst[dstIndex] = src[srcIndex];
               dst[dstIndex + 1] = src[srcIndex + 1];
               dst[dstIndex + 2] = src[srcIndex + 2];
               dst[dstIndex + 3] = src[srcIndex + 3];
               dstIndex += 4;
               srcIndex += 4;
            }

C (as B, but unrolled loop)

             srcIndex = 0;
            dstIndex = 0;

            for (int i = 0; i < count; i+=4)
            {
               dst[dstIndex] = src[srcIndex];
               dst[dstIndex + 1] = src[srcIndex + 1];
               dst[dstIndex + 2] = src[srcIndex + 2];
               dst[dstIndex + 3] = src[srcIndex + 3];
               dst[dstIndex + 4] = src[srcIndex + 4];
               dst[dstIndex + 5] = src[srcIndex + 5];
               dst[dstIndex + 6] = src[srcIndex + 6];
               dst[dstIndex + 7] = src[srcIndex + 7];
               dst[dstIndex + 8] = src[srcIndex + 8];
               dst[dstIndex + 9] = src[srcIndex + 9];
               dst[dstIndex + 10] = src[srcIndex + 10];
               dst[dstIndex + 11] = src[srcIndex + 11];
               dst[dstIndex + 12] = src[srcIndex + 12];
               dst[dstIndex + 13] = src[srcIndex + 13];
               dst[dstIndex + 14] = src[srcIndex + 14];
               dst[dstIndex + 15] = src[srcIndex + 15];
               dstIndex += 16;
               srcIndex += 16;
            }

Benchmark (Client VM)
A: 1978.8ms
B: 1179.7ms
C: 0973.7ms

Benchmark (Server VM)
A: 1525.5ms
B: 0918.3ms
C: 0718.7ms

Interesting huh? :o ;D

Seems like even the ServerVM cannot optimize it to the same (native) code.

Option C is a bit messy (and risky!), but the difference between A and B is huge, and doesn’t take a lot of effort to implement.

[quote]If you want to increase your performance bigtime:

Compare these three ways of doing the same thing:

Benchmark
A: 1978.8ms
B: 1179.7ms
C: 0973.7ms

Interesting huh? :o ;D
[/quote]
Very!

I also experimented with creating an lookup array of integers, basically pre-calculating the offsets in buffer b. However, this method was SLOWER than calculating the offsets on the fly somehow.

I’ll try and tweak the copy loop based on the examples you have given me.

could you explain why is calling “src++” (example A) is slower than using “src+0,src+1,src+1” and finally “src = src +3?” (example B). very odd.

I think… (that’s a big fat disclaimer… ;D)

index++
index++
index++
index++

does 4 assignments and native increments to [index], and they might be expensive, and cannot be executed in parallel (does a non-HT P4 anything in parallel anyway?)

whereas

index + 0
index + 1
index + 2
index + 3

index +=4;

has only 1 assignment, and could be done in parallel (again, i have no clue whether or not things like this can be done in parallel by my non-HT Intel cpu)

Very interesting indeed. I did something similar but used short buffers (16bit colour) so I could write the full colour info with one instruction. Have also done this with integer buffers for ARGB. I don’t have the code to hand but it was something like


BufferedImage img = new BufferedImage(??);
DataBufferUShort db = (DataBufferUShort)img.getRaster().getDataBuffer();
short buffer[] = db.getData();

I wonder how this compares to the above.