Tiny object performance overhead

Riven · September 22, 2006, 7:34pm

I read in some article* of a JVM engineer that creating new objects was ‘almost at the cost of shifting a pointer’.

_{* I tried hard to find the article, but sometimes java.sun.com is kinda hard to wade through}

Further, the GC is considered so intelligent and efficient, that its effect should be ‘noise’ even in performance-critical code.

Combining these two, would almost make you think allocating and discarding tiny objects is nearly free, or at least have a small impact.

I decided to give it a test, in a real-world application which has its bottleneck in some sphere<->triangle method.
Basic vector-math (Vec3) was implemented like:

public static final Vec3 add(Vec3 a, Vec3 b) {
    return new Vec3(a.x + b.x, a.y + b.y, a.z + b.z);
}

When I was writing this code it seemed horribly inefficient.

The next code, shows the algorithm:

      Vec3 ba = sub(b, a);
      Vec3 ca = sub(c, a);
      Vec3 pa = sub(p, a);
      float snom = dot(pa, ba);
      float tnom = dot(pa, ca);
      if (snom <= 0.0f && tnom <= 0.0f)
         return a;

      Vec3 cb = sub(c, b);
      Vec3 pb = sub(p, b);
      float unom = dot(pb, cb);
      float sdenom = dot(pb, sub(a, b));
      if (sdenom <= 0.0f && unom <= 0.0f)
         return b;

      Vec3 pc = sub(p, c);
      float tdenom = dot(pc, sub(a, c));
      float udenom = dot(pc, sub(b, c));
      if (tdenom <= 0.0f && udenom <= 0.0f)
         return c;

      Vec3 n = cross(ba, ca);

      Vec3 ap = sub(a, p);
      Vec3 bp = sub(b, p);
      float vc = dot(n, cross(ap, bp));
      if (vc <= 0.0f && snom >= 0.0f && sdenom >= 0.0f)
         return add(a, mul(snom / (snom + sdenom), ba));

      Vec3 cp = sub(c, p);
      float va = dot(n, cross(bp, cp));
      if (va <= 0.0f && unom >= 0.0f && udenom >= 0.0f)
         return add(b, mul(unom / (unom + udenom), cb));

      float vb = dot(n, cross(cp, ap));
      if (vb <= 0.0f && tnom >= 0.0f && tdenom >= 0.0f)
         return add(a, mul(tnom / (tnom + tdenom), ca));

      float u = va / (va + vb + vc);
      float v = vb / (va + vb + vc);
      float w = 1.0f - u - v;

      return add(add(mul(u, a), mul(v, b)), mul(w, c));

The following is the version where all Vec3 methods are inlined:

      float bax = b.x - a.x;
      float bay = b.y - a.y;
      float baz = b.z - a.z;

      float cax = c.x - a.x;
      float cay = c.y - a.y;
      float caz = c.z - a.z;

      float pax = p.x - a.x;
      float pay = p.y - a.y;
      float paz = p.z - a.z;

      float snom = pax * bax + pay * bay + paz * baz;
      float tnom = pax * cax + pay * cay + paz * caz;
      if (snom <= 0.0f && tnom <= 0.0f)
         return a;

      float abx = a.x - b.x;
      float aby = a.y - b.y;
      float abz = a.z - b.z;

      float cbx = c.x - b.x;
      float cby = c.y - b.y;
      float cbz = c.z - b.z;

      float pbx = p.x - b.x;
      float pby = p.y - b.y;
      float pbz = p.z - b.z;

      float unom = pbx * cbx + pby * cby + pbz * cbz;
      float sdenom = pbx * abx + pby * aby + pbz * abz;
      if (sdenom <= 0.0f && unom <= 0.0f)
         return b;

      float pcx = p.x - c.x;
      float pcy = p.y - c.y;
      float pcz = p.z - c.z;

      float acx = a.x - c.x;
      float acy = a.y - c.y;
      float acz = a.z - c.z;

      float bcx = b.x - c.x;
      float bcy = b.y - c.y;
      float bcz = b.z - c.z;

      float tdenom = pcx * acx + pcy * acy + pcz * acz;
      float udenom = pcx * bcx + pcy * bcy + pcz * bcz;
      if (tdenom <= 0.0f && udenom <= 0.0f)
         return c;

      float nx = bay * caz - baz * cay;
      float ny = baz * cax - bax * caz;
      float nz = bax * cay - bay * cax;

      float apx = a.x - p.x;
      float apy = a.y - p.y;
      float apz = a.z - p.z;

      float bpx = b.x - p.x;
      float bpy = b.y - p.y;
      float bpz = b.z - p.z;

      float APBPx = apy * bpz - apz * bpy;
      float APBPy = apz * bpx - apx * bpz;
      float APBPz = apx * bpy - apy * bpx;

      float vc = nx * APBPx + ny * APBPy + nz * APBPz;
      if (vc <= 0.0f && snom >= 0.0f && sdenom >= 0.0f)
      {
         Vec3 r = new Vec3();
         float t = snom / (snom + sdenom);
         r.x = bax * t + a.x;
         r.y = bay * t + a.y;
         r.z = baz * t + a.z;
         return r;
      }

      float cpx = c.x - p.x;
      float cpy = c.y - p.y;
      float cpz = c.z - p.z;

      float BPCPx = bpy * cpz - bpz * cpy;
      float BPCPy = bpz * cpx - bpx * cpz;
      float BPCPz = bpx * cpy - bpy * cpx;

      float va = nx * BPCPx + ny * BPCPy + nz * BPCPz;
      if (va <= 0.0f && unom >= 0.0f && udenom >= 0.0f)
      {
         Vec3 r = new Vec3();
         float t = unom / (unom + udenom);
         r.x = cbx * t + b.x;
         r.y = cby * t + b.y;
         r.z = cbz * t + b.z;
         return r;
      }

      float CPAPx = cpy * apz - cpz * apy;
      float CPAPy = cpz * apx - cpx * apz;
      float CPAPz = cpx * apy - cpy * apx;

      float vb = nx * CPAPx + ny * CPAPy + nz * CPAPz;
      if (vb <= 0.0f && tnom >= 0.0f && tdenom >= 0.0f)
      {
         Vec3 r = new Vec3();
         float t = (tnom / (tnom + tdenom));
         r.x = cax * t + a.x;
         r.y = cay * t + a.y;
         r.z = caz * t + a.z;
         return r;
      }

      float u = va / (va + vb + vc);
      float v = vb / (va + vb + vc);
      float w = 1.0f - u - v;

      Vec3 r = new Vec3();
      r.x = u * a.x + v * b.x + w * c.x;
      r.y = u * a.y + v * b.y + w * c.y;
      r.z = u * a.z + v * b.z + w * c.z;
      return r;

After warming both loops for several seconds, allowing the JVM to inline and optimize, these are the results:

[tr][td]Objects:[/td][td]1548ms[/td][td]1553ms[/td][td]1551ms[/td][/tr]
[tr][td]Inlined:[/td][td]505ms[/td][td]500ms[/td][td]558ms[/td][/tr]

This is clearly not ‘noise’ anymore (timing difference wise).

Some of you guys (to be honest, including me) would say: doh! - but I kinda started to believe they really reduced the overhead of objects. Sadly this doesn’t seem to be the case as of yet.

Riven · September 22, 2006, 7:48pm

Found Jeffs remarks on this topic:

http://wiki.java.net/bin/view/Games/JeffOnPerformance#Do_I_need_to_avoid_garbage_colle

[quote]This means you are free today to create objects just to pass in and out of method calls
or hold temporary values, a practice which makes your code a whole lot neater, less buggy,
and simpler to maintain.
[/quote]
I’ll continue my search for the article about the pointer-shift…
I found a quote of it, on another website:

[quote]Garbage Collection

The garbage collector has been greatly improved: creating a new object is now an incredibly
cheap operation, in most cases equivalent to shifting a pointer in memory. Don’t necessarily be
afraid of creating many short-lived objects, they will be garbage-collected very efficiently.
[/quote]

Matzon · September 23, 2006, 12:17am

are you sure that the methods are inlined? - else you’d have a method overhead in the object test

kappa · September 23, 2006, 12:56am

you know i had the exact same impression that creating small objects was free, just today i was trying to decide wheather to send 9 float as objects or directly as floats.

public void someMethod(float vec1x, float vec1y, float vec1z,
                          float vec2x, float vec2y, float vec2z,
                          float vec3x, float vec3y, float vec3z){
}

or wrap the values in Vector3f objects

public void someMethod(Vector3f a, Vector3f b, Vector3f c) {
}

clearly the second version is much nicer and cleaner but requires creating 3 more objects(of Vector3f), so according to your test first version would be more optimal performance wise?

Riven · September 23, 2006, 8:44am

Yup, running Xprof shows no sign of these methods anymore, they are interpretated a few times, then dissappear (0.4% of the ticks)

CommanderKeith · September 23, 2006, 9:18am

Very interesting stats.

Try Java 6, apparently ‘small object creation’ has become much more efficient. see:

http://www.javalobby.org/java/forums/t66270.html

PS: I’m sure you know but to to avoid warming up loops, try the VM with the -server option (only works on windows with JDK VM however).

Riven · September 23, 2006, 9:19am

kapta:

you know i had the exact same impression that creating small objects was free, just today i was trying to decide wheather to send 9 float as objects or directly as floats.
public void someMethod(float vec1x, float vec1y, float vec1z,
                          float vec2x, float vec2y, float vec2z,
                          float vec3x, float vec3y, float vec3z){
}
or wrap the values in Vector3f objects
public void someMethod(Vector3f a, Vector3f b, Vector3f c) {
}
clearly the second version is much nicer and cleaner but requires creating 3 more objects(of Vector3f), so according to your test first version would be more optimal performance wise?

New Object Loop

            int p = values.length - 1;
            while (p > 12)
            {
               Vec3 a = new Vec3(values[p--], values[p--], values[p--]);
               Vec3 b = new Vec3(values[p--], values[p--], values[p--]);
               Vec3 c = new Vec3(values[p--], values[p--], values[p--]);
               r += fancyCalc(a, b, c);
            }

Used Object Loop

            int p = values.length - 1;
            while (p > 12)
            {
              a.load(values[p--], values[p--], values[p--]);
              b.load(values[p--], values[p--], values[p--]);
              c.load(values[p--], values[p--], values[p--]);
              r += fancyCalc(a, b, c);
            }

Many Floats Loop

            int p = values.length - 1;
            while (p > 12)
            {
               r += fancyCalc(values[p--], values[p--], values[p--], values[p--], values[p--], values[p--], values[p--], values[p--], values[p--]);
            }

update: Float Array Loop

            int p = values.length - 1;
            while (p > 12)
            {
               r += fancyCalc(values, p -= 9);
            }

[tr]
[td][/td]
[td]Client VM 1.4[/td]
[td][b]Server VM 1.4[b][/td]
[td]-------[/td]
[td]Client VM 1.5[/td]
[td][b]Server VM 1.5[b][/td]
[td]-------[/td]
[td]Client VM 1.6[/td]
[td]Server VM 1.6[/td]
[/tr]

[tr]
[td]New Object Loop[/td]
[td]2266ms[/td]
[td]1453ms[/td]
[td][/td]
[td]2188ms[/td]
[td]1354ms[/td]
[td][/td]
[td]1427ms[/td]
[td]1094ms[/td]
[/tr]

[tr]
[td]Used Object Loop[/td]
[td]1404ms[/td]
[td]656ms[/td]
[td][/td]
[td]1326ms[/td]
[td]447ms[/td]
[td][/td]
[td]588ms[/td]
[td]281ms[/td]
[/tr]

[tr]
[td]Many Floats Loop[/td]
[td]1265ms[/td]
[td]328ms[/td]
[td][/td]
[td]1278ms[/td]
[td]246ms[/td]
[td][/td]
[td]420ms[/td]
[td]230ms[/td]
[/tr]

[tr]
[td]Float Array Loop[/td]
[td]?ms[/td]
[td]?ms[/td]
[td][/td]
[td]1206ms[/td]
[td]250ms[/td]
[td][/td]
[td]310ms[/td]
[td]219ms[/td]
[/tr]

Fancy calc

   private static final float fancyCalc(float ax, float ay, float az, float bx, float by, float bz, float cx, float cy, float cz)
   {
      float dotAB = ax * bx + ay * by + az * bz;
      float dotBC = bx * cx + by * cy + bz * cz;
      float dotCA = cx * ax + cy * ay + cz * az;

      return (dotAB + dotBC) * dotCA + (1.0f - dotCA);
   }

   private static final float fancyCalc(Vec3 a, Vec3 b, Vec3 c)
   {
      float dotAB = a.x * b.x + a.y * b.y + a.z * b.z;
      float dotBC = b.x * c.x + b.y * c.y + b.z * c.z;
      float dotCA = c.x * a.x + c.y * a.y + c.z * a.z;

      return (dotAB + dotBC) * dotCA + (1.0f - dotCA);
   }

   private static final float fancyCalc(float[] buf, int off)
   {
      float ax = buf[off + 0];
      float ay = buf[off + 1];
      float az = buf[off + 2];

      float bx = buf[off + 3];
      float by = buf[off + 4];
      float bz = buf[off + 5];

      float cx = buf[off + 6];
      float cy = buf[off + 7];
      float cz = buf[off + 8];

      float dotAB = ax * bx + ay * by + az * bz;
      float dotBC = bx * cx + by * cy + bz * cz;
      float dotCA = cx * ax + cy * ay + cz * az;

      return (dotAB + dotBC) * dotCA + (1.0f - dotCA);
   }

Ofcourse the body of fancyCalc is a bit too large to measure only the overhead of the way it is invoked, but it’s more ‘real world’ this way, instead of yet another ‘micro benchmark’

Riven · September 23, 2006, 10:01am

The Server VM takeas even longer to warm up. Anyway, I’m giving the VM more than enough time to warm up, so which VM is used doesn’t really matter.

CommanderKeith · September 23, 2006, 10:26am

Wow, quick reply!

That is disappointing, I thought hotspot would turn the ‘new object’ code into the ‘direct’ code. Well at least object creation has gotten better in Java 6. How badly do these bottlenecks affect you, because in all of my games it’s the blitting to the screen that takes most of the time.

I wonder why the 1.6 Client VM is so much quicker than the 1.5 equivalent when doing the ‘direct’ method?

PS: oops, I thought the server VM did all possible native code compilation AND inlining. So inlining must still be done dynamically at runtime by the server VM

Riven · September 23, 2006, 10:39am

Offtopic:

I found out that
FloatBuffer.get(int) is more than twice as slow in 6.0 (compared to 1.5) (both client VM)

fancyCalc FloatBuffer client 1.5: ~1000ms
fancyCalc FloatBuffer client 1.6: ~2400ms <— ?? serious regression :o >:(

fancyCalc FloatBuffer server 1.5: ~285ms
fancyCalc FloatBuffer server 1.6: ~290ms

FloatBuffer = direct, native-ordered buffer

Martin_Strand · September 23, 2006, 11:00am

It does more agressive inlining but still needs to warm up. You can use -Xcomp to have methods compiled the first time they’re invoked, that would get rid of the warmup loop.

Riven · September 23, 2006, 11:04am

Indeed, yet it often results in crappy optimized code, as the VM didn’t have enough time to properly analyze the code-paths and adjust the optimizing process with that data.

g666 · September 23, 2006, 4:46pm

Well i didnt ever see any examples that showed you could now create small objects for not much more of a cost than reusing them, so i didnt ever believe it, im not sure why any1 did.

CommanderKeith · September 24, 2006, 3:49am

Well I’ve been told many times here (& read elsewhere) that object pooling doesn’t give any performance boost - since the gc is so efficient & object creation is swift.

So object pooling can still be a good idea (if object creation is causing the bottleneck).

And what will you do with your vector-math code Riven - persevere with temporary objects or use primitives or object pooling?

Riven · September 24, 2006, 8:43am

I wrote this vector-math code for this test only. I always had a gut-feeling it would be dead-slow, so this was the only place that created all those objects.

Next test I’ll do will be with an ObjectPool. I have my doubts about your statement object-pooling still being feasible. We’ll see.

Linuxhippy · September 24, 2006, 3:06pm

However object pooling time is quite consistent and also does not hurt scalability on many CPUs.
I am working for a larger company and my job is to tune the stuff other (cheaper lol) programmes produce - if you’re running on a 32-64CPU machine generating garbage is VERY expensive and HURTS concurrency a lot. However managing memory yourself means … well you’ve to take care
Have a look at javolution, a nice framework for fast object pooling

lg Clemens

blahblahblahh · September 25, 2006, 8:19am

Um, are you sure you’ve got the right end of the stick here?

I thought the claim was that the garbage collection of small objects is now practically free, as small as shifting a pointer.

Off the top of my head, it is clearly impossible for object creation to be that cheap - you have to initialise lots of data in memory (think how much data an object actually contains under the hood if it contains merely a simple float)

rreyelts · September 25, 2006, 7:07pm

Think about it in terms of allocation versus initialization blah^3. Allocation is reserving address space for the object, and initialization is assigning actual values to fields, etc… So, allocation can indeed be as fast as a pointer bump.

I’ve done this in C++ code where I’ve written custom allocators for a routine. The routine allocates some millions of nodes over it’s relatively short (1 second) execution time. The allocator has a pre-allocated memory pool. When it needs to allocate a node, it simply bumps a pointer. No nodes get deallocated until the very end of the routine, at which point they are all “deallocated” by simply resetting the pointer to the top of the pool. This reduced allocation/deallocation times to just about nil.

You can do something similar in Java by creating an object pool, but those objects are still something that the gc is aware of.

rreyelts · September 25, 2006, 7:19pm

It depends on what your pools look like. If they’re MT-safe, that definitely hurts scalability. For example, Java heaps are tuned to be extremely fast for multi-threaded allocations. (They blow the bog-standard C++ allocators out of the water). They can do all sorts of dirty tricks like segmenting different areas of heap address space per thread to reduce contention. There are tricks you can do with object pools too (like creating threadlocal pools, but the cost of a threadlocal lookup isn’t zero), but they aren’t trivial and do involve other kinds of overhead.

Most people can just forget about allocation and pooling unless they’re creating millions of objects / second, or using a class that has heavyweight initialization (e.g. database connections).

princec · September 26, 2006, 10:02am

Pools are for objects that are expensive to construct and/or initialise.

Cas