Code optimization article

I recently came across this article:
http://www.azillionmonkeys.com/qed/optimize.html

The article (although a bit old) was written with games in mind, is geared towards C and ASM, and also deals with branch predictability and such.

I’ve done a small test in an inner loop in a real-life application with this ‘else’ removal technique explained in the article, but with no success.
I’m wondering to what degree some of these (micro-)optimization techniques (such as ‘else’ removal) are applicable to java programs (or in general on today’s CPUs)?

The micro-optimization techniques aside, it’s an interesting read.

A guess would be that while C compilers rely on a fixed rule about branches, “else branch is rarely taken”, hotspot is more than able to insert profiling code and simply measure which branch to optimize for.

  • elias

The ‘else’ removal technique really depends upon the code in the if and else blocks being very small and very easy to undo. It also depends upon a branch prediction miss being likely and very expensive. You’re best worrying about branch prediction within tight for loops. Even in that case, the processor’s probably only going to mis-predict a few times out of all of the loop iterations. Also, you’re increasing your code size, working against the instruction cache.

In general, I think we’re moving away from the trend of branch prediction getting costlier, because we’re moving more towards parallel programming - larger numbers of simpler cores combined with effective use of SIMD instructions. Cas made a statement a couple of years back about multi-threaded pipelines being pointless, but the reality is that anyone buying a new machine now is guaranteed to be getting at least a dual core setup, and in less than a year it’s going to be quad core.

Yep, no longer entirely pointless to think about multicore designs, though the big majority of users out there are going to have single core for the next 4-5 years. In the meantime though it’s nice to know that the garbage collector and hotspot server compiler can run quite effectively in their own threads now.

Cas :slight_smile:

Nice find!


   private static float simpleSum(float[] arr)
   {
      float sum = 0.0f;
      for (int i = 0; i < arr.length; i++)
         sum += arr[i];
      return sum;
   }

   private static float advancedSum(float[] arr)
   {
      int i = 0;

      float sumA = 0.0f;
      float sumB = 0.0f;
      float sumC = 0.0f;
      float sumD = 0.0f;

      int end = arr.length;

      end -= 3;

      while (i < end)
      {
         sumA += arr[i + 0];
         sumB += arr[i + 1];
         sumC += arr[i + 2];
         sumD += arr[i + 3];

         i += 4;
      }

      end += 3;

      float sum = (sumA + sumB) + (sumC + sumD);
      while (i < end)
         sum += arr[i++];

      return sum;
   }

simpleSum 6705ns (149K sums/s)
advancedSum 3911ns (255K sums/s)

71% faster!

My tests show this one is faster by 25% over the advancedSum:


    private static float fastAdvancedSum(float[] arr) {
        int i = 0;
        
        float sum = 0.0f;
        
        int end = arr.length;
        
        end -= 3;
        
        while (i < end) {
            sum += arr[i++];
            sum += arr[i++];
            sum += arr[i++];
            sum += arr[i++];
        }
        
        end += 3;
        
        while (i < end) {
            sum += arr[i++];
        }

        return sum;
    }

and 35% if you do it this way:


    private static float advancedSum(float[] arr) {
        int i = 0;
        
        float sum = 0.0f;
        
        int end = arr.length;
        
        end -= 3;
        
        while (i < end) {
            sum += arr[i++];
            sum += arr[i++];
            sum += arr[i++];
            sum += arr[i++];
            sum += arr[i++];
            sum += arr[i++];
            sum += arr[i++];
            sum += arr[i++];
        }
        
        end += 3;
        
        while (i < end) {
            sum += arr[i++];
        }

        return sum;
    }

That does rather strike me as being silly :slight_smile: The hotspot server compiler should probably be thinking about those kind of optimisations as it doesn’t half make the actual source code much more difficult to understand!

Cas :slight_smile:

The Server VM (and Client VM sometimes) is driving me nuts…

I’m putting the contents of a float[] into a strided datastructure with a certain offset and width (like VBO interleaved stuff)

One element can look like: [x,x,X,X,X,X,x,x,x] (repeats iself)

The code for this is:


   public final void putElements(int elementOffset, int elementLength, float[] array, int arrayOffset)
   {
      int dataOffset = this.dataOffsetOfElement(elementOffset);
      int lastDataOffset = dataOffset + elementLength * stride;

            while (dataOffset < lastDataOffset)
            {
               data[dataOffset + 0] = array[arrayOffset + 0];
               data[dataOffset + 1] = array[arrayOffset + 1];
               data[dataOffset + 2] = array[arrayOffset + 2];
               data[dataOffset + 3] = array[arrayOffset + 3];
               arrayOffset += width; // if I replace 'width' (which is 4) with '4', performance gets cut in half
               dataOffset += stride;
            }
   }

Now the silly part is that it’s gets executed much faster with a switch-statement around it:


      switch (width)
      {
         case 1:
            // this is the madness
            dataOffset += stride * elementLength;
            for (int e = elementLength - 1; e >= 0; e--)
               data[dataOffset -= stride] = array[arrayOffset + e];
            break;

         case 2:
            // code optimized for width of 2
            break;

         case 3:
            // code optimized for width of 3
            break;

         case 4: // running the code for width '4' inside the switch-statement is 21% faster than specialized method for width 4!! arg!
            while (dataOffset < lastDataOffset)
            {
               data[dataOffset + 0] = array[arrayOffset + 0];
               data[dataOffset + 1] = array[arrayOffset + 1];
               data[dataOffset + 2] = array[arrayOffset + 2];
               data[dataOffset + 3] = array[arrayOffset + 3];
               arrayOffset += width;
               dataOffset += stride;
            }
            break;

         case 5:
            // code optimized for width of 5
            break;

         default:
            // very very slow generic case
            for (int e = elementLength - 1; e >= 0; e--)
            {
               for (int i = width - 1; i >= 0; i--)
                  data[dataOffset + i] = array[arrayOffset + i];
               arrayOffset += width;
               dataOffset += stride;
            }
            break;
      }
   }

Making tiny changes, like replacing ‘width’ with ‘4’ makes the HotSpot VM drop everything and create some slow execution-path (maybe it requires an additional register, and it just ran out? Should that cause a performanceloss of 50%?)

When I code for performance, I have to find the fastest way with trial and error, the difference can be up to factor 3, when applying lots of seamingly irrelevant changes, or putting a lot of never-to-be-executed code around it, to make it 21% faster. And then that will only be the fastest loop on that version of that VM vendor.

I’m going to implement this tiny loop in C and make a DLL, that way I’m certain the code is natively compiled properly on every VM. it’s such a shame HotSpot isn’t predicatable.

Could be the warm up thing. I did’t allow for the VM to run for a long time through each of the examples to give it time to optimize.

Using this code:


import java.util.Random;

public class Test {
    public static void main(String[] args) {
        float nums[] = new float[2000];
        long time1, time2, time3, time4;
        float sum1, sum2, sum3, sum4;
        
        Random random = new Random(1);
        for(int i=0;i<2000;i++) {
            nums[i] = random.nextFloat() * 100.0f;
        }
        
        long time = System.nanoTime();
        long tick = time;
        int seconds = 0;
        while(System.nanoTime() - time < 300000000000L) {
            simpleSum(nums);
            advancedSum(nums);
            fastAdvancedSum(nums);
            if(System.nanoTime() - tick > 10000000000L) {
                tick = System.nanoTime();
                seconds += 10;
                System.out.println(seconds);
            }
        }
        
        time1 = System.nanoTime();
        sum1 = simpleSum(nums);
        time1 = System.nanoTime() - time1;
        time2 = System.nanoTime();
        sum2 = advancedSum(nums);
        time2 = System.nanoTime() - time2;
        time3 = System.nanoTime();
        sum3 = fastAdvancedSum(nums);
        time3 = System.nanoTime() - time3;
        time4 = System.nanoTime();
        sum4 = fastAdvancedSum(nums);
        time4 = System.nanoTime() - time4;
        
        System.out.println("After a 5 minute warm up the execution times are\r\n");
        System.out.println("      simpleSum " + sum1 + " took " + time1 + " nanoseconds");
        System.out.println("    advancedSum " + sum2 + " took " + time2 + " nanoseconds " + ((1.0 - ((double)time2 / (double)time1)) * 100.0) + "% faster");
        System.out.println(" fixAdvancedSum " + sum3 + " took " + time3 + " nanoseconds " + ((1.0 - ((double)time3 / (double)time1)) * 100.0) + "% faster");
        System.out.println("fastAdvancedSum " + sum4 + " took " + time4 + " nanoseconds " + ((1.0 - ((double)time4 / (double)time1)) * 100.0) + "% faster");
    }

    private static float simpleSum(float[] arr) {
        float sum = 0.0f;
        for (int i = 0; i < arr.length; i++) {
            sum += arr[i];
        }

        return sum;
    }

    private static float advancedSum(float[] arr) {
        int i = 0;

        float sumA = 0.0f;
        float sumB = 0.0f;
        float sumC = 0.0f;
        float sumD = 0.0f;

        int end = arr.length;

        end -= 3;

        while (i < end) {
            sumA += arr[i + 0];
            sumB += arr[i + 1];
            sumC += arr[i + 2];
            sumD += arr[i + 3];

            i += 4;
        }

        end += 3;

        float sum = (sumA + sumB) + (sumC + sumD);
        while (i < end) {
            sum += arr[i++];
        }

        return sum;
    }
    private static float fixAdvancedSum(float[] arr) {
        int i = 0;

        float sum = 0.0f;

        int end = arr.length;

        end -= 3;

        while (i < end) {
            sum += arr[i + 0];
            sum += arr[i + 1];
            sum += arr[i + 2];
            sum += arr[i + 3];

            i += 4;
        }

        end += 3;

        while (i < end) {
            sum += arr[i++];
        }

        return sum;
    }
    private static float fastAdvancedSum(float[] arr) {
        int i = 0;
        
        float sum = 0.0f;
        
        int end = arr.length;
        
        end -= 3;
        
        while (i < end) {
            sum += arr[i++];
            sum += arr[i++];
            sum += arr[i++];
            sum += arr[i++];
        }
        
        end += 3;
        
        while (i < end) {
            sum += arr[i++];
        }

        return sum;
    }
}

I get the following results:


Client VM
      simpleSum 99848.016 took 27099 nanoseconds
    advancedSum 99848.05 took 13968 nanoseconds 48.45566257057457% faster
 fixAdvancedSum 99848.016 took 30172 nanoseconds -11.339901841396372% faster
fastAdvancedSum 99848.016 took 29613 nanoseconds -0.0927709509576% faster

Server VM
      simpleSum 99848.016 took 69003 nanoseconds
    advancedSum 99848.05 took 7543 nanoseconds 89.06859122067156% faster
 fixAdvancedSum 99848.016 took 8660 nanoseconds 87.44982102227439% faster
fastAdvancedSum 99848.016 took 7543 nanoseconds 89.06859122067156% faster

After a 5 minute warm up, the server VM really does its job. advancedSum and fastAdvancedSum run the same spped. However, advancedSum suffers from floating point error and never returns the correct result. fixAdvancedSum fixxes this problem, but then runs a little slower.

Well, this particular performance optimization trick in ‘advanceSum’ is known to degrade floating point accuracy, so HotSpot should never do this.
And besides, I can’t remember having to write a for loop which sums up the contents of an entire array the last few years and which was also performance bottleneck :slight_smile:

But sure, I suppose there’s still room in HotSpot for improvement.