Enhanced loop in Tiger

I have just downloaded 1.5 and started playing with it.
I made a small test with the new enhanced loops. So far I thought it was just a syntax improvment. I did not realize it can help improve the performance of loops also. Isn’t that cool or did I just wake up after everybody ?

Before 1.5, one would write (some stupid test) like this:

// array is int[]

            for (int i=0, n=array.length; i<n; i++)
            {
                  int p = array[i];
                  p=p+p+p;
            }

in 1.5, it becomes:

          for (int p : array)
            {
                  p=p+p+p;
             }

There is no penalty access to the elements of the array and my first tests show it is actually faster (and yes it is a micro benchmark however it have some audio processing code with tight loops goobling most of the processing time). I wonder if the JVM also can optimize the bound checks since it ‘knows’ that the whole array is going to be scanned ?

Interesting. I guess in that case it is always guaranteed that the index are never out of bounds… I imagine the byte code that is output is therefore exactly what hotspot expects in order for it to do the bounds check elimination. You should use javap to see how the byte code differs between the two loop formats.

[quote] for (int i=0, n=array.length; i<n; i++)
[/quote]
Possibly your attempted optimisation of copying the array length to ‘n’ actually slows the loop down. Try using

for (int i=0; i<array.length; i++)

instead.

A quick test:


public class Test {
    private static int[] test = new int[10000];

    public static void main(String[] args) {
      // Init
      for (int i = 0; i < test.length; i++) {
          test[i] = i;
      }

      // Jit
      testOld();
      testOld();
      testNew();
      testNew();

      // Test old:
      long old = System.nanoTime();
      
      long total = 0;
      for (int i = 0; i < 10; i++) {
          total += testOld();
      }
      
      long now = System.nanoTime();
      
      System.out.printf("Old took %.3f ms and produced %d\n", ((now - old)/1000000.0f), total);
      
      // Test new:
      old = System.nanoTime();
      
      total = 0;
      for (int i = 0; i < 10; i++) {
          total += testNew();
      }
      
      now = System.nanoTime();
      
      System.out.printf("New took %.3f ms and produced %d\n", ((now - old)/1000000.0f), total);
    }

    private static int testOld() {
      int sum = 0;
      for (int i = 0; i < test.length; i++) {
          sum += test[i];
      }
      return sum;
    }

    private static int testNew() {
      int sum = 0;
      for (int i : test) {
          sum += i;
      }
      return sum;
    }
}

…gives:


private static int testOld();
  Code:
   0:      iconst_0
   1:      istore_0
   2:      iconst_0
   3:      istore_1
   4:      iload_1
   5:      getstatic      #2; //Field test:[I
   8:      arraylength
   9:      if_icmpge      26
   12:      iload_0
   13:      getstatic      #2; //Field test:[I
   16:      iload_1
   17:      iaload
   18:      iadd
   19:      istore_0
   20:      iinc      1, 1
   23:      goto      4
   26:      iload_0
   27:      ireturn

private static int testNew();
  Code:
   0:      iconst_0
   1:      istore_0
   2:      getstatic      #2; //Field test:[I
   5:      astore_1
   6:      aload_1
   7:      arraylength
   8:      istore_2
   9:      iconst_0
   10:      istore_3
   11:      iload_3
   12:      iload_2
   13:      if_icmpge      32
   16:      aload_1
   17:      iload_3
   18:      iaload
   19:      istore      4
   21:      iload_0
   22:      iload      4
   24:      iadd
   25:      istore_0
   26:      iinc      3, 1
   29:      goto      11
   32:      iload_0
   33:      ireturn


Old took 1.008 ms and produced 499950000
New took 0.675 ms and produced 499950000

Yay - very nice :slight_smile:

(


Old took 20.450 ms and produced 7049827040
New took 23.482 ms and produced 7049827040

If array is ten times bigger :frowning:
)

interesting that the new for construct doesn’t generate the optimal loop :-


      int sum = 0;
      int [] array = test;
      int length = array.length;
      for(int i = 0; i < length; i++)
      {
         sum += array[i];
      }
      return sum;

it generates :-


      int sum = 0;
      int [] array = test;
      int length = array.length;
      for(int i = 0; i < length; i++)
      {
         int j = array[i];
         sum+=j;
      }
      return sum;

This is bad for 2 reasons :-/

  1. more bytecode
  2. it uses more than 4 local variables.

I would not care about number of local variables - it will be optimized by jit anyway to no-op.

I also don’t see a real problem with number of bytecodes - unless you are coding for j2me, in such case foreach is out of reach anyway :wink:

But seeing as its automatically generating the code, what is the reason for it to NOT generate the optimal loop :-/

Keep in mind that even beofre 1.5 there should be no penalty to the array access. Thsi is because array[i] has absolutely prdictable bounds.

Such a small microbenchmark has lots of places it can go wrong. Keep in mind the On Stack reaplcement issue and other issues that often make microbenchmarks like this more or less useless…

JK

Btw… just the fact that the relationship chnages with much bigger loop sizes suggests a warm-up problem in the benchmark.

There mere fact that both code versions actually differ is already food for thought IMO (that was not obvious from the documentation I read about 1.5).

@Mark,
I do not think so, this syntax is used in lots of places in java source files. At worst it does not optimize anything but I would be surprised if it slows down the test.

@Anders,
Did you wait for the warmup to finish as Jeff suggested ?

@Jeff,
This kind of micro benchmarks can easily go wrong but they are interesting when they mimick the hot spots of your program (small loops called zillions of time to process audio/video data for instance).

Are you saying that the hotspot JVM is ‘smart’ enough to remove array bound checkings before 1.5 ? (I thought bound checkings could be only removed by a flag) It means it is able to analyze that the index is not modified inside of the loop code, that the increment is contant (maybe it must be 1 in order to remove bound cheks). If so, then it is not obvious there is any gain at all, otherwise, the loops with the new syntax should be faster.

The server JVM in 1.4 is supposed to be able to remove array bounds checks. The client VM does not.

[quote]The server JVM in 1.4 is supposed to be able to remove array bounds checks. The client VM does not.
[/quote]
Where did you get this information?

I think the only thing to compare in these cases is the bytecode that is generated. If the bytecode from the new method is larger that is bad because it will be less likely to be inlined. I don’t see the relevance in the number of local variables, as the only thing that seems relevant it that they will fit in registers, but of course that is only an issue AFTER compiling to native code… so it ultimately could be the same code that hotspot generates, if some locals can be eliminated when the bytecode is compiled.

As Jeff says the test run was far too short initially.

Java bytecode has special instructions for storing and loading from the 1st 4 local variables. (1byte instruction)
That example above uses 5 local variables, so 1 of the locals is being accessed using the less compact, generic instruction. (1byte instruction, 1byte argument [hmm, or is it a word?])

I’m sure it makes absolutely no difference once compiled to native (if it did, I would be very worried :P).
I was just pointing out that an automatically generated loop is less than optimal, which seems illogical to me.

Check the performance of System.arraycopy (which is native code) versus your own pure Java array copy code and see how it fares.

If it’s properly compiled and optimised we should see the difference tending towards 0.

Cas :slight_smile:

Well, now - the above was just the largest difference the new for loop could make to execution speed. Smart bytecode versus jitting :slight_smile:

When main is extracted to a seperate static method (letting the jit compile it) and run in a loop from main until it stabilizes the following numbers show (a factor of 100 bigger than the initial run - size doesn’t matter when it stabilizes even in the 10000 case):

java [1.5beta1]:


Old took 224.790 ms and produced 17832936640
New took 218.237 ms and produced 17832936640
Old took 227.440 ms and produced 17832936640
New took 212.157 ms and produced 17832936640

and so on…

java -server:


Old took 92.369 ms and produced 17832936640
New took 95.806 ms and produced 17832936640
Old took 97.163 ms and produced 17832936640
New took 93.987 ms and produced 17832936640

java -Xint:


Old took 1905.819 ms and produced 17832936640
New took 1483.741 ms and produced 17832936640

The new for loop is approx. 5-10% faster (i.e. not much!) when run under client, on server the bound checks are eliminated making the comparision void. Interesting though that the server version is more than 20x faster… :slight_smile:

http://java.sun.com/j2se/1.4/performance.guide.html

I knew I’d seen it somewhere, but it took a while to find again.

What do you mean ‘making the comparison void’ ? It looks to me that there is a significant speed up using the new method on the server VM. The new method is > 20% faster according to the numbers above.

Interesting. But slightly incomplete… he have to assume so much. That page states that array bounds check elimination was added to the server VM in 1.4 (presumably 1.4.0). But I haven’t seen anything that states what optimizations are done in the current client VM. For 1.4.1 through to 1.5 beta 1 (on which these test were run) there could be additional optimizations to the client VM. It would be great to have a definitive answer as to what the precise differences between client VM and server VM are in the later releases.

http://java.sun.com/docs/performance/
http://java.sun.com/j2se/1.4.2/1.4.2_whitepaper.html

Nothing for 1.4.1 or 1.5 (yet).

Nah, that’s the interpreted results you have there :slight_smile:

In the server version the difference is +/- < 5% (i.e. alternating and within error margin)