The JIT learns by example, better teach it right!

Riven · January 9, 2007, 1:40am

I fed (float) data with various sizes into an algorithm, and measured the performance.

Case 1:


for(int i=0 i<256; i++)
{
   process(new float[64]);
   process(new float[256]);
   process(new float[1024]);
   process(new float[4096]);
   process(new float[16384]);
}

Case 2:


for(int i=0 i<256; i++)
{
   process(new float[16384]);
   process(new float[4096]);
   process(new float[1024]);
   process(new float[256]);
   process(new float[64]);
}

How fast will these perform?
Case 1: all variations run at 2350 elements/sec Case 2: all variations run at 850 elements/sec

It is significant!
I once had a framedrop of 50% in my (then geometry-transform bottlenecked) game after I changed something trivial that seemed to have nothing to do with it. So it’s not so much about feeding the JIT realworld examples, it’s trial and error, then feed what the JIT likes best… and then you only know how to threat that very JIT.

To get the best performance with any JIT, you might launch a few sub-processes, see which version the JIT likes, then use that as a warmup for the JIT you’re dealing with in your main-process. This quickly spirals out of control, when you’re dealing with more than 3 performance-critical sections in an application.

Or get rid of this JIT magic and go native, as that always performs best!

swpalmer · January 9, 2007, 2:19am

I hate to nit-pick but the above is of course not true as it ignores the advantage of a JIT compiler’s ability to adapt to the current runtime conditions.

Anyway, I’m curious about your experiment. Can you provide more info on the “process” method? I wonder for example if the adaptive nature of the JIT backfired by compiling the code in a manner that was more appropriate for smaller data sets before it got to the larger ones.

I really want to get a copy of a JVM that supports a fancy flag that spits out the machine code it generates.

Riven · January 9, 2007, 2:24am

True, but C/C++ always does vector code (pointers + SIMD) roughly factor 2.4 faster (on my machine)

The JIT didn’t backfire, if the first pass(es) are small arrays, it kept performing better on larger and larger arrays/buffers.

I thought there IS a flag like that in Java 6.0, was searching for it… some folks talked about it on this very forum! (don’t you just love this community)

Riven · January 9, 2007, 2:28am

Aaah… found it:
-XX:+PrintOptoAssembly

Doesn’t work… :

Riven · January 9, 2007, 3:36am


   public static final void main(String[] args)
   {
      int max = 9;

      int start, stop, step;

      // this controls the order
      boolean smallToLarge = true;

      if (smallToLarge)
      {
         start = 0;
         stop = max;
         step = 1;
         System.out.println("ORDER: SMALL[64] -> LARGE[16k]");
      }
      else
      {
         start = max - 1;
         stop = 0;
         step = -1;
         System.out.println("ORDER: LARGE[16k] -> SMALL[64]");
      }

      for (int i = start; (smallToLarge ? i < stop : i > stop); i += step)
      {
         int size = 1 << (i + 6);

         long elapsed = benchmark(size);

         System.out.println("size=" + size + ", \ttook: " + (elapsed/1000L) + "us \t(" + (elapsed / size) + " ns/element -> "+(size * 1000000000L / elapsed)+" elements/s)");

         for (int k = 0; k < 8; k++)
            System.gc();

         try
         {
            Thread.sleep(2 * 1000);
         }
         catch (Exception exc)
         {
            //
         }
      }
   }


   public static final long benchmark(int size)
   {
      int byteSize = size * 4;

      FloatBuffer bufferA = ByteBuffer.allocateDirect(byteSize).order(ByteOrder.nativeOrder()).asFloatBuffer();
      FloatBuffer bufferB = ByteBuffer.allocateDirect(byteSize).order(ByteOrder.nativeOrder()).asFloatBuffer();
      FloatBuffer bufferC = ByteBuffer.allocateDirect(byteSize).order(ByteOrder.nativeOrder()).asFloatBuffer();

      int runs = 32;
      int loops = 1024;

      int recordedRuns = 0;

      long recordedElapsed = 0L;

      for (int i = 0; i < runs; i++)
      {
         if (i == runs / 2 - 1)
         {
            // reset halfway
            recordedElapsed = 0L;
            recordedRuns = 0;
         }

         long start = System.nanoTime();
         for (int k = 0; k < loops; k++)
         {
            faddN_buffer(bufferA, bufferB, bufferC, size);
            faddN_buffer(bufferB, bufferC, bufferA, size);
            faddN_buffer(bufferC, bufferA, bufferB, size);
         }
         long elapsed = System.nanoTime() - start;
         recordedElapsed += elapsed;

         recordedRuns++;
      }

      return recordedElapsed / recordedRuns;
   }


   /**
    * method that controls the unrollment of the algorithm
    */

   private static final void faddN_buffer(FloatBuffer a, FloatBuffer b, FloatBuffer c, int n)
   {
      int n8 = n >> 3;
      int n4 = (n & 7) >> 2;
      int n1 = n & 3;

      int pA = 0;
      int pB = 0;
      int pC = 0;

      if (n8 != 0)
      {
         pA -= 8;
         pB -= 8;
         pC -= 8;
         for (int i = n8 - 1; i >= 0; i--)
            fadd8_buffer(a, b, c, pA += 8, pB += 8, pC += 8);
      }

      if (n4 != 0)
      {
         pA -= 4;
         pB -= 4;
         pC -= 4;
         for (int i = n4 - 1; i >= 0; i--)
            fadd4_buffer(a, b, c, pA += 4, pB += 4, pC += 4);
      }

      if (n1 != 0)
      {
         pA -= 1;
         pB -= 1;
         pC -= 1;
         for (int i = n1 - 1; i >= 0; i--)
            fadd1_buffer(a, b, c, pA += 1, pB += 1, pC += 1);
      }
   }

   /**
    * methods that manually unroll loops (much faster)
    */

   private static final void fadd8_buffer(FloatBuffer a, FloatBuffer b, FloatBuffer c, int pa, int pb, int pc)
   {
      c.put(pc + 0, a.get(pa + 0) + b.get(pb + 0));
      c.put(pc + 1, a.get(pa + 1) + b.get(pb + 1));
      c.put(pc + 2, a.get(pa + 2) + b.get(pb + 2));
      c.put(pc + 3, a.get(pa + 3) + b.get(pb + 3));

      c.put(pc + 4, a.get(pa + 4) + b.get(pb + 4));
      c.put(pc + 5, a.get(pa + 5) + b.get(pb + 5));
      c.put(pc + 6, a.get(pa + 6) + b.get(pb + 6));
      c.put(pc + 7, a.get(pa + 7) + b.get(pb + 7));
   }

   private static final void fadd4_buffer(FloatBuffer a, FloatBuffer b, FloatBuffer c, int pa, int pb, int pc)
   {
      c.put(pc + 0, a.get(pa + 0) + b.get(pb + 0));
      c.put(pc + 1, a.get(pa + 1) + b.get(pb + 1));
      c.put(pc + 2, a.get(pa + 2) + b.get(pb + 2));
      c.put(pc + 3, a.get(pa + 3) + b.get(pb + 3));
   }

   private static final void fadd1_buffer(FloatBuffer a, FloatBuffer b, FloatBuffer c, int pa, int pb, int pc)
   {
      c.put(pc, a.get(pa) + b.get(pb));
   }

`
Java 6.0 Server VM
ORDER: SMALL[64] -> LARGE[16k]
size=64, took: 3850us (60157 ns/element -> 16622 elements/s)
size=128, took: 1515us (11839 ns/element -> 84461 elements/s)
size=256, took: 2638us (10305 ns/element -> 97038 elements/s)
size=512, took: 5657us (11050 ns/element -> 90494 elements/s)
size=1024, took: 12177us (11892 ns/element -> 84087 elements/s)
size=2048, took: 24268us (11849 ns/element -> 84390 elements/s)
size=4096, took: 48544us (11851 ns/element -> 84376 elements/s)
size=8192, took: 97251us (11871 ns/element -> 84234 elements/s)
size=16384, took: 200368us (12229 ns/element -> 81769 elements/s)

ORDER: LARGE[16k] -> SMALL[64]
size=16384, took: 678194us (41393 ns/element -> 24158 elements/s)
size=8192, took: 310363us (37886 ns/element -> 26394 elements/s)
size=4096, took: 149891us (36594 ns/element -> 27326 elements/s)
size=2048, took: 74897us (36570 ns/element -> 27344 elements/s)
size=1024, took: 37610us (36728 ns/element -> 27226 elements/s)
size=512, took: 18856us (36829 ns/element -> 27152 elements/s)
size=256, took: 9516us (37174 ns/element -> 26900 elements/s)
size=128, took: 4693us (36668 ns/element -> 27271 elements/s)
size=64, took: 2405us (37581 ns/element -> 26608 elements/s)

Note: performance difference: factor 3+
`

` [b]Java 6.0 Client VM[/b] ORDER: SMALL[64] -> LARGE[16k] (all 4.8k to 5.2k elements/s) ORDER: LARGE[16k] -> SMALL[64] (all 5.3k to 5.4k elements/s)

Note: slower than Client 5.0 (400%) and Server 5.0 (700%) !! :o (see below)
`

` [b]Java 5.0 Server VM[/b] ORDER: SMALL[64] -> LARGE[16k] (all 33k to 37k elements/s) ORDER: LARGE[16k] -> SMALL[64] (all 33k to 37k elements/s)

Java 5.0 Client VM
ORDER: SMALL[64] -> LARGE[16k] (all 19k elements/s)
ORDER: LARGE[16k] -> SMALL[64] (all 19k elements/s)

Note: server ~80% faster than client
`

erikd · January 9, 2007, 9:43am

So if I understand correctly you’ve hit a Java6 VM bug, and this is not an example of a general flaw of JIT technology.
Looks like a good test case for your bug report

Riven · January 9, 2007, 1:34pm

Well, I’ve had similair problems in other JITs, where the initial order of algorithm made a significant difference in performance.

And ofcourse all the code that has exactly the same functionality, with marginally reordered bytecode, that could take or gain you 50% performance.

When I need fast code these days, I write a number of permutations of the algorithm (tiny changes) and pick the fastest.

For example, there is a tiny bug in the benchmark code.


         pA -= 8;
         pB -= 8;
         pC -= 8;
         for (int i = n8 - 1; i >= 0; i--)
            fadd8_buffer(a, b, c, pA += 8, pB += 8, pC += 8);

Should have been:


         pA -= 8;
         pB -= 8;
         pC -= 8;
         for (int i = n8 - 1; i >= 0; i--)
            fadd8_buffer(a, b, c, pA += 8, pB += 8, pC += 8);
         pA += 8;
         pB += 8;
         pC += 8;

While this reduces the amount of operations with 8, performance was 95k/s, and after the fix it’s 50k/s.

This is exactly what I mean with tiny changes in bytecode have major effect :-\

Riven · January 9, 2007, 3:23pm

Implementing the same algorithm with different data-structures, also can impact performance pretty seriously


      // tight loop
      for (int k = 0; k < n; k++)
         c.put(k, a.get(k) + b.get(k));

      // unrolled(4)
      for (int k = 0; k < n;)
      {
         c.put(k, a.get(k) + b.get(k));         k++;
         c.put(k, a.get(k) + b.get(k));         k++;
         c.put(k, a.get(k) + b.get(k));         k++;
         c.put(k, a.get(k) + b.get(k));         k++;
      }

      // unrolled(8)
      for (int k = 0; k < n;)
      {
         c.put(k, a.get(k) + b.get(k));         k++;
         c.put(k, a.get(k) + b.get(k));         k++;
         c.put(k, a.get(k) + b.get(k));         k++;
         c.put(k, a.get(k) + b.get(k));         k++;
         c.put(k, a.get(k) + b.get(k));         k++;
         c.put(k, a.get(k) + b.get(k));         k++;
         c.put(k, a.get(k) + b.get(k));         k++;
         c.put(k, a.get(k) + b.get(k));         k++;
      }

[b]C [ i ] = A [ i ] + B [ i ] [/b] buffer: tight loop 118k/s unrolled(4x) 155k/s ( +31%) unrolled(8x) 165k/s ( +40%) array: tight loop 89k/s unrolled(4x) 184k/s (+106%) unrolled(8x) 229k/s (+157%)

See the extreme difference in performance between buffer/array! (ignoring pointer for the moment)
For other algorithms (more complex - lots of random access R/W), pointer-structures almost always win.

Let’s change the algorithm to an exponential one:


      // Case 1
      for (int i = 0; i < n; i++)
         for (int j = 0; j < n; j++)
            for (int k = 0; k < n; k++)
               c.put(k, a.get(i) + b.get(j));

      // Case 2a
      for (int i = 0; i < n; i++)
         for (int j = 0; j < n; j++)
            for (int k = 0; k < n;)
            {
               c.put(k, a.get(i) + b.get(j)); k++;
               c.put(k, a.get(i) + b.get(j)); k++;
               c.put(k, a.get(i) + b.get(j)); k++;
               c.put(k, a.get(i) + b.get(j)); k++;
            }


      // Case 2b
      for (int i = 0; i < n; i++)
         for (int j = 0; j < n; j++)
            for (int k = 0; k < n;)
            {
               c.put(k++, a.get(i) + b.get(j));
               c.put(k++, a.get(i) + b.get(j));
               c.put(k++, a.get(i) + b.get(j));
               c.put(k++, a.get(i) + b.get(j));
            }

      // Case 2c
      for (int i = 0; i < n; i++)
         for (int j = 0; j < n; j++)
            for (int k = 0; k < n; k += 4)
            {
               c.put(k + 0, a.get(i) + b.get(j));
               c.put(k + 1, a.get(i) + b.get(j));
               c.put(k + 2, a.get(i) + b.get(j));
               c.put(k + 3, a.get(i) + b.get(j));
            }

      // Case 2d
      for (int i = 0; i < n; i++)
         for (int j = 0; j < n; j++)
            for (int k = 0; k < n; k += 4)
            {
               c.put(k, a.get(i) + b.get(j));
               c.put(k + 1, a.get(i) + b.get(j));
               c.put(k + 2, a.get(i) + b.get(j));
               c.put(k + 3, a.get(i) + b.get(j));
            }


      // Case 2a
      for (int i = 0; i < n; i++)
         for (int j = 0; j < n; j++)
            for (int k = 0; k < n;)
            {
               c[k] = a[i] + b[j]; k++;
               c[k] = a[i] + b[j]; k++;
               c[k] = a[i] + b[j]; k++;
               c[k] = a[i] + b[j]; k++;
            }

      // Case 2b
      for (int i = 0; i < n; i++)
         for (int j = 0; j < n; j++)
            for (int k = 0; k < n;)
            {
               c[k++] = a[i] + b[j];
               c[k++] = a[i] + b[j];
               c[k++] = a[i] + b[j];
               c[k++] = a[i] + b[j];
            }

      // Case 2c
      for (int i = 0; i < n; i++)
         for (int j = 0; j < n; j++)
            for (int k = 0; k < n; k += 4)
            {
               c[k + 0] = a[i] + b[j];
               c[k + 1] = a[i] + b[j];
               c[k + 2] = a[i] + b[j];
               c[k + 3] = a[i] + b[j];
            }

      // Case 2d
      for (int i = 0; i < n; i++)
         for (int j = 0; j < n; j++)
            for (int k = 0; k < n; k += 4)
            {
               c[k] = a[i] + b[j];
               c[k + 1] = a[i] + b[j];
               c[k + 2] = a[i] + b[j];
               c[k + 3] = a[i] + b[j];
            }

[b]FloatBuffer[/b] Case 1: 1801k/s Case 2a: 2630k/s Case 2b: 2872k/s Case 2c: 3218k/s :-* Case 2d: 2623k/s :'( 22% performance-loss by removing "+ 0" `` [b]Float Array[/b] Case 1: 3464k/s Case 2a: 4116k/s Case 2b: 4225k/s Case 2c: 4226k/s Case 2d: 4228k/s :-* `` [b]Float Pointer[/b] Case 1: 3336k/s Case 2: 3571k/s

[b]Int Buffer[/b] Case 2a: 2593k/s Case 2b: 2423k/s Case 2c: 2562k/s Case 2d: 2564k/s `` [b]Int Array[/b] Case 1: 1784k/s Case 2a: 1887k/s Case 2b: 2030k/s Case 2c: 2064k/s Case 2d: 2064k/s `` [b]Int Pointer[/b] Case 1: 3272k/s Case 2: 3930k/s :-*

javazoid · January 11, 2007, 8:27am

It seems to me somewhat curious that no “officials” posted a comment on all the nice stuff above.
It would be very nice to hear from them and get some good indications and hints.

Cheers,

Mik

erikd · January 11, 2007, 12:03pm

I’m not sure what you mean by Int Pointer and Float Pointer structures, could you explain? Is it what you just demonstrated in your other thread?

Riven · January 11, 2007, 1:41pm

It’s the code using sun.misc.Unsafe, not using ‘mapped objects’

unsafe.putInt(pntr, value);
value = unsafe.getInt(pntr);

swpalmer · January 12, 2007, 12:31am

I thought there was one too, but my quick search didn’t find it. If I recall correctly it only works in the debug builds of the JVM… you could probably get a debug version of Mustang, er… Java 6, from the project site at java.net.

BTW, have you filed a bug report? I think Sun takes performance regressions very seriously, and it appears the Java 6 client VM certainly has a problem.