Bound's checks and struct's

Then let’s vote for C#-like properties, which gives you nice short syntax and total lack of knowledge about side effects of simple assignments.

Anyway, even for that there is a ‘soft’ solution - bytecode translation on runtime (which is currently nicely exposed in core java API). Yes, it is a bit of hack, but benefit is that it will not extend a language in any way for people who don’t want it. For you, it would be just putting one call at start of application to apply the bytecode filter, plus importing one library (this library being very dependent on JDK version unfortunately, not to mention vendor even).

So, below example would be something like


public final class XYWithSomeGaps extends MappedStructure {
  @Offset(8) public int x;
  @Offset(16) public int y;
}

MappedStructure would expose only attach method. You would directly access x,y fields on XYWithSomeGaps. Runtime bytecode weaver would convert XYWithSomeGaps to contain getters/setters and all needed magic for conversions (if fields are not defined to be in native endianess for example), plus would convert every access to it to getX/getY/etc. Hotspot would then convert these accessors back to direct memory references :slight_smile:

Well, great idea.

Lets make Java to what C++ was - a really fat and bloated language, hard to learn really hard to build compilers for.
Java really gets fat - boxing is nice and generics are too (from my point of view it was a fault to include this later on) but isnt that enough.
Whats about direct, unchecked memory access. Would for sure boost performance, but to what price??

hmm, lg Clemens

Can the same performance be achieved without the Sun specific unsafe hacks?

Mapped Object wouldn’t be adding bloat, really. And it’s not unsafe at all.

Mapped Objects wouldn’t be allocated on the heap, but in a native (properly ordered) ByteBuffer.

That’s all there is to it.

Nothing we can come up with will be uglier, more complicated for compilers, less clear than generics as they are currently.

If you need it, you need it. You need to it to the point of scrapping all the java just because lack of it.

I partially agree with you - it is probably too specific to be incorporated in mainstream java. I think that some combination of annotations, Unsafe hackery and bytecode weaver is just enough to get everything Cas wants.

I have tried to do similar class on ByteBuffer (just replacing every get/set method with access to buffer)… Enought to say that inner loop had like 100 cpu instructions instead of 6 including 4-5 branches instead one at end. That was after making byte buffer native order, before it was even worse (hotspot was actually calling some methods inside the loop).

Maybe there is some trick to get this improved - I cannot find it at the moment.

Have you tried an IntBuffer view on the ByteBuffer? Apparently this is necessary to get any optimisation.

I’ve benchmarked these two classes:


   private static class ClazzA
   {
      public int a, b, c;
      public float x, y, z;
   }

   private static class ClazzB
   {
      private final static int ALIGN = 4;
      private final static int SIZEOF = (3 + 3) * ALIGN;
      private final static ByteBuffer bb = BufferUtils.createByteBuffer(SIZEOF * 64);
      private final static IntBuffer ib = bb.asIntBuffer();
      private final static FloatBuffer fb = bb.asFloatBuffer();
      private static int nextOffset = 0;

      private final int off;



      public ClazzB()
      {
         off = nextOffset;

         nextOffset += SIZEOF / ALIGN;
      }



      public final void a(int a) { ib.put(off+0, a); }
      public final void b(int b) { ib.put(off+1, b); }
      public final void c(int c) { ib.put(off+2, c); }
      public final void x(float x) { fb.put(off+3, x); }
      public final void y(float y) { fb.put(off+4, y); }
      public final void z(float z) { fb.put(off+5, z); }
      public final int a() { return ib.get(off+0); }
      public final int b() { return ib.get(off+1); }
      public final int c() { return ib.get(off+2); }
      public final float x() { return fb.get(off+3); }
      public final float y() { return fb.get(off+4); }
      public final float z() { return fb.get(off+5); }
   }

The benchmark did this calculation on the instances:


                  a.c = a.a + a.b;
                  a.b = a.c + a.a;
                  a.a = a.b - a.c;

...

                  b.c(b.a() + b.b());
                  b.b(b.c() + b.a());
                  b.a(b.b() - b.c());

Results:

Total Math on fields: 430ms
Total Math on buffer: 711ms
Performance difference factor: 1.6534884

Total Math on fields: 381ms
Total Math on buffer: 315ms
Performance difference factor: 0.8267717

Total Math on fields: 268ms
Total Math on buffer: 328ms
Performance difference factor: 1.2238806

Total Math on fields: 245ms
Total Math on buffer: 329ms
Performance difference factor: 1.3428571

Total Math on fields: 248ms
Total Math on buffer: 319ms
Performance difference factor: 1.2862903

Total Math on fields: 251ms
Total Math on buffer: 334ms
Performance difference factor: 1.3306773

Total Math on fields: 255ms
Total Math on buffer: 335ms
Performance difference factor: 1.3137255

Total Math on fields: 254ms
Total Math on buffer: 328ms
Performance difference factor: 1.2913386

Isn’t this enough performance?

33% performance loss, without changing the VM, without using Unsafe
Only 22% performance loss when running with -Xcomp

;D

Note that this is done with the Sun server VM 1.5.0_05

Benchmark code:

it does:

for(8x)
{
long aTook = 0;
long bTook = 0;

for(4x)
{
// init values

 for(1M x)
    run 4 blocks of math on ClazzA and measure the result
  aTook += tDiff;

 for(1M x)
    run 4 blocks of math on ClazzB and measure the result
  bTook += tDiff;

 // compare values

}

// print aTook & bTook
}

         int classes = 64;
         ClazzA[] cA = new ClazzA[classes];
         ClazzB[] cB = new ClazzB[classes];

         for (int i = 0; i < classes; i++)
         {
            cA[i] = new ClazzA();
            cB[i] = new ClazzB();
         }

         //

         int loop = 1024 * 1024;

         for (int p = 0; p < 8; p++)
         {
            long tA = 0;
            long tB = 0;

            for (int k = 0; k < 4; k++)
            {
               // init
               for (int i = 0; i < classes; i++)
               {
                  ClazzA a = cA[i % classes];
                  ClazzB b = cB[i % classes];

                  a.a = 5; a.b = 4; a.c = 3; a.x = 1.0F; a.y = 1.1F; a.z = 1.2F;
                  b.a(5); b.b(4); b.c(3); b.x(1.0F); b.y(1.1F); b.z(1.2F);
               }

               // calc
               long t0A = System.nanoTime();
               for (int i = 0; i < loop; i++)
               {
                  ClazzA a = cA[i % classes];

                  a.c = a.a + a.b;                  a.b = a.c + a.a;                  a.a = a.b - a.c;
                  a.c = a.a + a.b;                  a.b = a.c + a.a;                  a.a = a.b - a.c;
                  a.c = a.a + a.b;                  a.b = a.c + a.a;                  a.a = a.b - a.c;
                  a.c = a.a + a.b;                  a.b = a.c + a.a;                  a.a = a.b - a.c;

                  a.z = a.x + a.y;                  a.y = a.z + a.x;                  a.x = a.y - a.z;
                  a.z = a.x + a.y;                  a.y = a.z + a.x;                  a.x = a.y - a.z;
                  a.z = a.x + a.y;                  a.y = a.z + a.x;                  a.x = a.y - a.z;
                  a.z = a.x + a.y;                  a.y = a.z + a.x;                  a.x = a.y - a.z;
               }
               long t1A = System.nanoTime();

               long t0B = System.nanoTime();
               for (int i = 0; i < loop; i++)
               {
                  ClazzB b = cB[i % classes];

                  b.c(b.a() + b.b());                  b.b(b.c() + b.a());                  b.a(b.b() - b.c());
                  b.c(b.a() + b.b());                  b.b(b.c() + b.a());                  b.a(b.b() - b.c());
                  b.c(b.a() + b.b());                  b.b(b.c() + b.a());                  b.a(b.b() - b.c());
                  b.c(b.a() + b.b());                  b.b(b.c() + b.a());                  b.a(b.b() - b.c());

                  b.z(b.x() + b.y());                  b.y(b.z() + b.x());                  b.x(b.y() - b.z());
                  b.z(b.x() + b.y());                  b.y(b.z() + b.x());                  b.x(b.y() - b.z());
                  b.z(b.x() + b.y());                  b.y(b.z() + b.x());                  b.x(b.y() - b.z());
                  b.z(b.x() + b.y());                  b.y(b.z() + b.x());                  b.x(b.y() - b.z());
               }
               long t1B = System.nanoTime();

               // validate
               if (false)
               {
                  for (int i = 0; i < classes; i++)
                  {
                     ClazzA a = cA[i % classes];
                     ClazzB b = cB[i % classes];

                     boolean errorI = (a.a != b.a()) || (a.b != b.b()) || (a.c != b.c());
                     boolean errorF = (a.x != b.x()) || (a.y != b.y()) || (a.z != b.z());

                     if (errorI || errorF)
                        throw new IllegalStateException("Error @ " + i);
                  }
               }

               tA += ((t1A - t0A) / 1000000);
               tB += ((t1B - t0B) / 1000000);
            }

            System.out.println();
            System.out.println("Total Math on fields: " + tA + "ms");
            System.out.println("Total Math on buffer: " + tB + "ms");
            System.out.println("Performance difference factor: " + (float) tB / tA);
         }

Riven, what changed between your current post and the rather slower performance shown on the initial post (before editing)?

Hehe, I’ve got the feeling i’m being watched :slight_smile:

Okay, here we go:

I had it like this:

private final int a0, a1, a2, a3, a4, a5;

ClassB()
{
   a0 = nextOffset + 0;
   a1 = nextOffset + 1;
   a2 = nextOffset + 2;
  etc...
}

buffer.get(a0);
buffer.get(a1);
buffer.get(a2);
etc...

Now it’s:

private final int off;

ClassB()
{
   off = nextOffset;
}

buffer.get(off+0);
buffer.get(off+1);
buffer.get(off+2);
etc...

So the second option was much faster: from factor 2.7 to 1.3.

It also saves ram, so no tradeoffs…

That is bizarre. An apparently trivial change makes the performance change by a factor of 2. It doesn’t give me confidence that the code would work as expected on another JVM (e.g. one by IBM or Apple).

Write once, profile everywhere

After rearranging code a bit (to put loops inside their methods, so it will be easier to see what is happening) and simplifying loops (hotspot was doing everything on registers after few first memory accesses, plus it was again to much code to check), I have come up with following


  private static void testB(int classes) {
    for(int i = 0; i < loop; i++) {
      ClazzB b = cB[i % classes];
      b.c(b.a() + b.b());
      b.z(b.x() + b.y());  
    }
  }

  private static void testA(int classes) {
    for(int i = 0; i < loop; i++) {
      ClazzA a = cA[i % classes];
      a.c = a.a + a.b;
      a.z = a.x + a.y;
    }
  }

and generated code for both methods (inner loop only, rest is no important) is

testA


080   B6: #	B10 B7 <- B5 B8 	Loop: B6-B8 inner stride: not constant  Freq: 7.51066
080   	MOV    EAX,ESI
082   	MOV    ECX,[ESP + #4]
086   	CDQ
	IDIV   ECX
0a0   	CMPu   EDX,EBX
0a2   	Jge,us B10  P=0.000001 C=-1.000000
0a2
0a4   B7: #	B11 B8 <- B6  Freq: 7.51065
0a4   	MOV    EBP,[EDI + #12 + EDX << #2]
0a8   	MOV    EDX,[EBP + #8]
0ab   	NullCheck EBP
0ab
0ab   B8: #	B6 B9 <- B7  Freq: 7.51065
0ab   	MOVSS  XMM0a,[EBP + #24]
0b0   	MOVSS  XMM2a,[EBP + #20]
0b5   	MOV    ECX,[EBP + #12]
0b8   	ADDSS  XMM2a,XMM0a
0bc   	MOVSS  [EBP + #28],XMM2a
0c1   	ADD    EDX,ECX
0c3   	MOV    [EBP + #16],EDX
0c6   	INC    ESI
0c7   	CMP    ESI,#1048576
0cd   	Jlt,s  B6  P=1.000000 C=7.509333
0cd

and testB



1b0   B18: #	B35 B19 <- B17 B32 	Loop: B18-B32 inner stride: not constant  Freq: 14.1767
1b0   	MOV    EAX,EBX
1b2   	MOV    ECX,[ESP + #28]
1b6   	CDQ
	IDIV   ECX
1d0   	CMPu   EDX,[ESP + #36]
1d4   	Jge,u  B35  P=0.000001 C=-1.000000
1d4
1da   B19: #	B34 B20 <- B18  Freq: 14.1766
1da   	MOV    EDI,[ESP + #32]
1de   	MOV    ECX,[EDI + #12 + EDX << #2]
1e2   	MOV    [ESP + #8],ECX
1e6   	MOV    ECX,[ECX + #8]
1e9   	NullCheck ECX
1e9
1e9   B20: #	B64 B21 <- B19  Freq: 14.1766
1e9   	TEST   ECX,ECX
1eb   	Jlt    B64  P=0.000000 C=6.667333
1eb
1f1   B21: #	B62 B22 <- B20  Freq: 6.66733
1f1   	CMP    ECX,[ESP + #60]
1f5   	Jge    B62  P=0.000000 C=6.667333
1f5
1fb   B22: #	B59 B23 <- B21  Freq: 6.66733
1fb   	MOV    ESI,ECX
1fd   	INC    ESI
1fe   	MOV    EDI,ECX
200   	SHL    EDI,#2
203   	MOV    EBP,[ESP + #4]
207   	ADD    EBP,EDI
209   	MOV    EDX,[EBP]
20c   	TEST   ESI,ESI
20e   	Jlt    B59  P=0.000000 C=6.667333
20e
214   B23: #	B57 B24 <- B22  Freq: 6.66733
214   	CMP    ESI,[ESP + #60]
218   	Jge    B57  P=0.000000 C=6.667333
218
21e   B24: #	B54 B25 <- B23  Freq: 6.66733
21e   	MOV    EAX,[EBP + #4]
221   	ADD    EAX,EDX
223   	MOV    EDX,ECX
225   	ADD    EDX,#2
228   	TEST   EDX,EDX
22a   	Jlt    B54  P=0.000000 C=6.667333
22a
230   B25: #	B52 B26 <- B24  Freq: 6.66733
230   	CMP    EDX,[ESP + #60]
234   	Jge    B52  P=0.000000 C=6.667333
234
23a   B26: #	B49 B27 <- B25  Freq: 6.66733
23a   	MOV    [EBP + #8],EAX
23d   	MOV    EAX,ECX
23f   	ADD    EAX,#3
242   	TEST   EAX,EAX
244   	Jlt    B49  P=0.000000 C=6.667333
244
24a   B27: #	B47 B28 <- B26  Freq: 6.66733
24a   	CMP    EAX,[ESP + #56]
24e   	Jge    B47  P=0.000000 C=6.667333
24e
254   B28: #	B44 B29 <- B27  Freq: 6.66733
254   	MOV    EBP,[ESP + #0]
257   	ADD    EBP,EDI
259   	MOVSS  XMM0a,[EBP + #12]
25e   	MOV    EAX,ECX
260   	ADD    EAX,#4
263   	TEST   EAX,EAX
265   	Jlt    B44  P=0.000000 C=6.667333
265
26b   B29: #	B42 B30 <- B28  Freq: 6.66733
26b   	CMP    EAX,[ESP + #56]
26f   	Jge    B42  P=0.000000 C=6.667333
26f
275   B30: #	B39 B31 <- B29  Freq: 6.66733
275   	MOVSS  XMM2a,[EBP + #16]
27a   	ADDSS  XMM2a,XMM0a
27e   	ADD    ECX,#5
281   	TEST   ECX,ECX
283   	Jlt,s  B39  P=0.000000 C=6.667333
283
285   B31: #	B37 B32 <- B30  Freq: 6.66733
285   	CMP    ECX,[ESP + #56]
289   	Jge,s  B37  P=0.000000 C=6.667333
289
28b   B32: #	B18 B33 <- B31  Freq: 6.66733
28b   	MOVSS  [EBP + #20],XMM2a
290   	INC    EBX
291   	CMP    EBX,#1048576
297   	Jlt    B18	# Loop end  P=1.000000 C=7.509333

I’m quite surprised that speed difference is only about 1.4 (on my AMD machine with short loop).

Ok, I think I have found the problem. Replace the loops with


  private static void testB(int classes) {
    for(int i = 0; i < loop; i++) {
      ClazzB b = cB[i&0x2f];
      b.c(b.a() + b.b());
      b.z(b.x() + b.y());  
    }
  }

  private static void testA(int classes) {
    for(int i = 0; i < loop; i++) {
      ClazzA a = cA[i&0x2f];
      a.c = a.a + a.b;
      a.z = a.x + a.y;
    }
  }

We were basically measuring speed of CDQ IDIV ECX, not the field access.

Now, for the short loop, difference is 3.64 times.

Edit:
Still, there was something wrong. After changing test methods to


private static void testB(int classes) {
    
    for(int i = 0; i < loop; i++) {
      ClazzB b = cB[0];
      b.c(b.a() + b.b());
      b.z(b.x() + b.y());  
      b = cB[1];
      b.c(b.a() + b.b());
      b.z(b.x() + b.y());  
      b = cB[2];
      b.c(b.a() + b.b());
      b.z(b.x() + b.y());  
      b = cB[3];
      b.c(b.a() + b.b());
      b.z(b.x() + b.y());  
    }
  }

  private static void testA(int classes) {

    for(int i = 0; i < loop; i++) {
      ClazzA a = cA[0];
      a.c = a.a + a.b;
      a.z = a.x + a.y;
      a = cA[1];
      a.c = a.a + a.b;
      a.z = a.x + a.y;
      a = cA[2];
      a.c = a.a + a.b;
      a.z = a.x + a.y;
      a = cA[3];
      a.c = a.a + a.b;
      a.z = a.x + a.y;
    }
  }

difference is 11 -13 times, depending on jvm (5.0 or 6.0).

[quote=“abies,post:34,topic:24918”]
After I simplicied the loops, I get performance factor 1.14

Not that you do: “& 0x2F” which is 47, while there are 64 elements in the array, so it must be “& 0x3F”

I already PMed you, but you haven’t replied yet, so could you please tell my how to get that native code printed? Thanks.

On my system:

               for (int i = 0; i < loop; i++)
               {
                  ClazzB b = cB[i % classes];

                  b.c(b.a() + b.b());
                  b.b(b.c() + b.a());
                  b.a(b.b() - b.c());

                  b.z(b.x() + b.y());
                  b.y(b.z() + b.x());
                  b.x(b.y() - b.z());
               }

Factor = 1.14


               for (int i = 0; i < loop; i++)
               {
                  ClazzB b = cB[i & 0x3F];

                  b.c(b.a() + b.b());
                  b.b(b.c() + b.a());
                  b.a(b.b() - b.c());

                  b.z(b.x() + b.y());
                  b.y(b.z() + b.x());
                  b.x(b.y() - b.z());
               }

Factor = 1.70

So it’s :
ClazzB b = cB[i & 0x3F]; —> 1.70
ClazzB b = cB[i % classes]; —> 1.14

It seems we’re benchmarking the wrong stuff here (the managing code)

I’ll rewrite the benchmark!


         int classes = 64;
         
         ClazzA[] cA = new ClazzA[classes];
         ClazzB[] cB = new ClazzB[classes];

         for (int i = 0; i < classes; i++)
         {
            cA[i] = new ClazzA();
            cB[i] = new ClazzB();
         }

         //

         int loop = 1024 * 128;

         for (int p = 0; p < 8; p++)
         {
            long tA = 0;
            long tB = 0;

            for (int k = 0; k < 4; k++)
            {
               // calc
               long t0A = System.nanoTime();
               for (int i = 0; i < loop; i++)
               {
                  ClazzA a = cA[i & 0x3F];
                  for (int j = 0; j < 64; j++)
                     comp(a);
               }
               long t1A = System.nanoTime();

               long t0B = System.nanoTime();
               for (int i = 0; i < loop; i++)
               {
                  ClazzB b = cB[i & 0x3F];
                  for (int j = 0; j < 64; j++)
                     comp(b);
               }
               long t1B = System.nanoTime();

               tA += ((t1A - t0A) / 1000000);
               tB += ((t1B - t0B) / 1000000);
            }

            System.out.println("Performance factor: " + (float) tB / tA + " (Time: " + (tA + tB) + "ms)");
         }



   private static final void comp(ClazzA a)
   {
      a.c = a.a + a.b;
      a.b = a.c + a.a;
      a.a = a.b - a.c;

      a.z = a.x + a.y;
      a.y = a.z + a.x;
      a.x = a.y - a.z;
   }



   private static final void comp(ClazzB b)
   {
      b.c(b.a() + b.b());
      b.b(b.c() + b.a());
      b.a(b.b() - b.c());

      b.z(b.x() + b.y());
      b.y(b.z() + b.x());
      b.x(b.y() - b.z());
   }
Performance factor: 1.3216169 (Time: 1321ms)
Performance factor: 1.2138836 (Time: 1180ms)
Performance factor: 1.5396290 (Time: 1506ms)
Performance factor: 1.2973485 (Time: 1213ms)
Performance factor: 1.3528184 (Time: 1127ms)
Performance factor: 1.3636364 (Time: 1118ms)
Performance factor: 1.3215767 (Time: 1119ms)
Performance factor: 1.3859276 (Time: 1119ms)

The perils of micro benchmarks!

Yup, that’s why I posted the benchmark, so that others could point out the weaknesses. Very useful :slight_smile: