Hmm. Can this little piece be optimized further?

barfy · June 23, 2004, 8:00am

// pixel is a byte array
// currentTextureMap is a byte array

int newpix = (pixel[index] & 0xFF) + (currentTextureMap[tIndex] & 0xFF); 
                               
if(newpix > 255)
     newpix = 255;
                               
pixel[index] = (byte) newpix;

Can this section of code be made any faster? This is pretty crucial since this portion gets executed many times per frame.

I was wondering if there’s a way to limit the variable newpix (using bit operations?) so that it is always between the values 0-255 (if the value is greater than 255, than it should truncate to 255). Then I could simply discard that if statement which checks for overflow.

Herkules · June 23, 2004, 8:41am

int newpix = (pixel[index] + currentTextureMap[tIndex]) & 0xFF;

Hows that?

erikd · June 23, 2004, 8:43am

As far as I know, it can’t be replaced by a simple bit operation but the if statement is very cheap anyway. The only possible speed optimization I can see is to have an unsigned byte type in java (am I right? you would be able to skip the masking when converting to an int), but that doesn’t help you much does it?

erikd · June 23, 2004, 8:45am

I don’t think the result will be same, or will they?

Herkules · June 23, 2004, 8:52am

pixel[index] = (byte)(pixel[index] + currentTextureMap[tIndex]);

This should do it in one step.

barfy · June 23, 2004, 8:53am

Nope, that gives a different result. For example, if pixel[index] = -1 and currentTextureMap[tIndex] = -1,

Then the result in my original code would be 510, which is then truncated by that if statement to 255.
Your version would give 254.

Sigh, what I would really need is an unsigned byte type. But thanks anyway.

Herkules · June 23, 2004, 8:54am

pixel[index] & 0xFF

this doesn’t do anything bc. pixel[i] already is a byte! Just a waste of cycles.

barfy · June 23, 2004, 8:57am

Actually pixel[index] is a SIGNED byte, so calling pixel[index] & 0xFF converts it to the equivalent of an UNSIGNED byte (actually cast to an int type).

However, merely casting pixel[index] to an int type using ipixel[index][/i] would not change the sign of the byte.

Herkules · June 23, 2004, 9:01am

ic, you’re right … &0xff implicitely casts to int though…

erikd · June 23, 2004, 9:03am

Therefor we need to bug Sun to reconsider http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4186775

barfy · June 23, 2004, 9:09am

[quote]As far as I know, it can’t be replaced by a simple bit operation but the if statement is very cheap anyway.
[/quote]
Not so cheap when it’s executed more than 10000x per frame. I ran a profiler and this section is taking up 70% of my processor time.

Even just 1 or 2 less instructions in the code makes a substantial difference.

crystalsquid · June 23, 2004, 9:14am

You are right that you want to dispose of the branch if possible (conditional branches are nasty for most CPU’s), so to get round it you do:


newpix |= ((255-newpix)>>31);

If newpix < 255, then 255-newpix is positive. the shift right 31 will propogate the sign bit through the whole int, giving ‘0’. ORing this in gives no change.

If newpix >=255, the result of this line is ‘0xffffffff’, and when OR’d in makes newpix 0xffffffff as well, so when you cast back to a byte you will get ‘0xff’ - the clamped value you were after.

This takes 3 logical ops (3 cycles), compared to a branch predict error (~30 cycles) whenever the clamping is used. So if you clamp ~10% of pixels this way, the chances are that it will be around the same speed

Your original looks a little odd though - are your arrays really byte arrays? Are you not dealling with multiple colour channels at least? If you are tring to do additive or alpha blending, there are slightly more efficient ways to do this as you can ususally get away with dealling with R and B combined in one int, saving you 1/3rd of the work for a blend.

Hope this helps,

Dom

mthornton · June 23, 2004, 9:39am

[quote]This takes 3 logical ops (3 cycles)
[/quote]
Could be a lot more on some CPU that don’t have a barrel shifter (Pentium 4?).

barfy · June 23, 2004, 9:57am

[quote]You are right that you want to dispose of the branch if possible (conditional branches are nasty for most CPU’s), so to get round it you do:


newpix |= ((255-newpix)>>31);

If newpix < 255, then 255-newpix is positive. the shift right 31 will propogate the sign bit through the whole int, giving ‘0’. ORing this in gives no change.

If newpix >=255, the result of this line is ‘0xffffffff’, and when OR’d in makes newpix 0xffffffff as well, so when you cast back to a byte you will get ‘0xff’ - the clamped value you were after.
[/quote]
Thanks. Unfortunately, it’s slower now :D. Although that’s really quite an elegant way of skirting the conditional statement.

I think the issue is not so much a branch prediction error/cache miss, but the sheer number of instructions that gets executed per frame.

EDIT: I’m using a p4 so part of the slowdown could probably be with the issue described by Mark in his post above.

I’m actually working with 8-bit IndexColorModels and a DataBuffer.Byte pixel array. The addition that you see is just adding corresponding pixel values from a pre-defined 8-bit texture map to the DataBuffer.Byte pixel array.

Hmm. What you suggested got me thinking though. I wonder if I could use a DataBuffer.Int with the IndexColorModel so that 4 8-bit pixel values can be combined in an int, and then perform the adding on the int instead…

Thanks

Herkules · June 23, 2004, 10:33am

DataBuffer.MMX would be helpful

phazer · June 23, 2004, 10:42am

You could also try this:


// a[i] = i for i < 256 and a[i] = 255 for 256 <= i <512

pixel[index] = a[newpix];

Don’t know if there will be a speed increase, but it’s worth a shot. The array is so small it will probably fit inside the L1 cache.

erikd · June 23, 2004, 12:41pm

[quote]What you suggested got me thinking though. I wonder if I could use a DataBuffer.Int with the IndexColorModel so that 4 8-bit pixel values can be combined in an int, and then perform the adding on the int instead…
[/quote]
I’m guessing you would have to do an awful lot of masking instead to prevent overflows to ‘bleed’ into the wrong bits, or am I missing something?

tom · June 23, 2004, 1:09pm

Here is some code that adds the rgb components of a int using the same method as crystalsquid. There is some loss in precision, and the 4th component needs to be handled seperatly

      public final static int addSaturated(int a, int b)
      {
            a &= 0xfefefe;
            b &= 0xfefefe;
            int ab = a+b;
            int sign = ab & 0x01010100;
            int sum = (sign-(sign>>8)) | ab;
            return sum;
      }

barfy · June 23, 2004, 3:19pm

[quote]You could also try this:


// a[i] = i for i < 256 and a[i] = 255 for 256 <= i <512

pixel[index] = a[newpix];

Don’t know if there will be a speed increase, but it’s worth a shot. The array is so small it will probably fit inside the L1 cache.
[/quote]
That gives roughly the same, maybe a little slower speeds than with the “if” statement. Probably because there’s the array bounds check with each random access.

erikd · June 23, 2004, 3:21pm

Overflows are now going to the wrong color and even to the 4th byte (i.e. addSaturated(0xff00ff, 0xff00ff) results in 0x1ff01ff) so the result can be slightly wrong. Maybe this isn’t a problem though, but then again maybe it is…