Hmm. Can this little piece be optimized further?

// pixel is a byte array
// currentTextureMap is a byte array

int newpix = (pixel[index] & 0xFF) + (currentTextureMap[tIndex] & 0xFF); 
                               
if(newpix > 255)
     newpix = 255;
                               
pixel[index] = (byte) newpix;

Can this section of code be made any faster? This is pretty crucial since this portion gets executed many times per frame.

I was wondering if there’s a way to limit the variable newpix (using bit operations?) so that it is always between the values 0-255 (if the value is greater than 255, than it should truncate to 255). Then I could simply discard that if statement which checks for overflow.

int newpix = (pixel[index] + currentTextureMap[tIndex]) & 0xFF;  

Hows that?

As far as I know, it can’t be replaced by a simple bit operation but the if statement is very cheap anyway. The only possible speed optimization I can see is to have an unsigned byte type in java (am I right? you would be able to skip the masking when converting to an int), but that doesn’t help you much does it? :wink:

I don’t think the result will be same, or will they?

pixel[index] = (byte)(pixel[index] + currentTextureMap[tIndex]);    

This should do it in one step.

Nope, that gives a different result. For example, if pixel[index] = -1 and currentTextureMap[tIndex] = -1,

  1. Then the result in my original code would be 510, which is then truncated by that if statement to 255.

  2. Your version would give 254.

Sigh, what I would really need is an unsigned byte type. But thanks anyway.

pixel[index] & 0xFF

this doesn’t do anything bc. pixel[i] already is a byte! Just a waste of cycles.

Actually pixel[index] is a SIGNED byte, so calling pixel[index] & 0xFF converts it to the equivalent of an UNSIGNED byte (actually cast to an int type).

However, merely casting pixel[index] to an int type using ipixel[index][/i] would not change the sign of the byte.

ic, you’re right … &0xff implicitely casts to int though…

Therefor we need to bug Sun to reconsider http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4186775

[quote]As far as I know, it can’t be replaced by a simple bit operation but the if statement is very cheap anyway.
[/quote]
Not so cheap when it’s executed more than 10000x per frame. I ran a profiler and this section is taking up 70% of my processor time.

Even just 1 or 2 less instructions in the code makes a substantial difference.

You are right that you want to dispose of the branch if possible (conditional branches are nasty for most CPU’s), so to get round it you do:


newpix |= ((255-newpix)>>31);

If newpix < 255, then 255-newpix is positive. the shift right 31 will propogate the sign bit through the whole int, giving ‘0’. ORing this in gives no change.

If newpix >=255, the result of this line is ‘0xffffffff’, and when OR’d in makes newpix 0xffffffff as well, so when you cast back to a byte you will get ‘0xff’ - the clamped value you were after.

This takes 3 logical ops (3 cycles), compared to a branch predict error (~30 cycles) whenever the clamping is used. So if you clamp ~10% of pixels this way, the chances are that it will be around the same speed :slight_smile:

Your original looks a little odd though - are your arrays really byte arrays? Are you not dealling with multiple colour channels at least? If you are tring to do additive or alpha blending, there are slightly more efficient ways to do this as you can ususally get away with dealling with R and B combined in one int, saving you 1/3rd of the work for a blend.

Hope this helps,

  • Dom

[quote]This takes 3 logical ops (3 cycles)
[/quote]
Could be a lot more on some CPU that don’t have a barrel shifter (Pentium 4?).

[quote]You are right that you want to dispose of the branch if possible (conditional branches are nasty for most CPU’s), so to get round it you do:


newpix |= ((255-newpix)>>31);

If newpix < 255, then 255-newpix is positive. the shift right 31 will propogate the sign bit through the whole int, giving ‘0’. ORing this in gives no change.

If newpix >=255, the result of this line is ‘0xffffffff’, and when OR’d in makes newpix 0xffffffff as well, so when you cast back to a byte you will get ‘0xff’ - the clamped value you were after.
[/quote]
Thanks. Unfortunately, it’s slower now :D. Although that’s really quite an elegant way of skirting the conditional statement.

I think the issue is not so much a branch prediction error/cache miss, but the sheer number of instructions that gets executed per frame.

EDIT: I’m using a p4 so part of the slowdown could probably be with the issue described by Mark in his post above.

I’m actually working with 8-bit IndexColorModels and a DataBuffer.Byte pixel array. The addition that you see is just adding corresponding pixel values from a pre-defined 8-bit texture map to the DataBuffer.Byte pixel array.

Hmm. What you suggested got me thinking though. I wonder if I could use a DataBuffer.Int with the IndexColorModel so that 4 8-bit pixel values can be combined in an int, and then perform the adding on the int instead…

Thanks :slight_smile:

DataBuffer.MMX would be helpful :slight_smile:

You could also try this:


// a[i] = i for i < 256 and a[i] = 255 for 256 <= i <512

pixel[index] = a[newpix];

Don’t know if there will be a speed increase, but it’s worth a shot. The array is so small it will probably fit inside the L1 cache.

[quote]What you suggested got me thinking though. I wonder if I could use a DataBuffer.Int with the IndexColorModel so that 4 8-bit pixel values can be combined in an int, and then perform the adding on the int instead…
[/quote]
I’m guessing you would have to do an awful lot of masking instead to prevent overflows to ‘bleed’ into the wrong bits, or am I missing something?

Here is some code that adds the rgb components of a int using the same method as crystalsquid. There is some loss in precision, and the 4th component needs to be handled seperatly :frowning:

      public final static int addSaturated(int a, int b)
      {
            a &= 0xfefefe;
            b &= 0xfefefe;
            int ab = a+b;
            int sign = ab & 0x01010100;
            int sum = (sign-(sign>>8)) | ab;
            return sum;
      }

[quote]You could also try this:


// a[i] = i for i < 256 and a[i] = 255 for 256 <= i <512

pixel[index] = a[newpix];

Don’t know if there will be a speed increase, but it’s worth a shot. The array is so small it will probably fit inside the L1 cache.
[/quote]
That gives roughly the same, maybe a little slower speeds than with the “if” statement. Probably because there’s the array bounds check with each random access.

Overflows are now going to the wrong color and even to the 4th byte (i.e. addSaturated(0xff00ff, 0xff00ff) results in 0x1ff01ff) so the result can be slightly wrong. Maybe this isn’t a problem though, but then again maybe it is…