JStackAlloc - stack allocation of "value" objects in Java

Hi all,

I’ve created library that implements support for stack allocation of “value” objects, such as vectors and matrices. It’s evolved version of stack allocation that I’ve used in JBullet, where it uses manual managing of stack state. This library generates this repetitive and cumbersome code with automatical bytecode instrumentation using ANT task and provides nice API.

Requires Java 5 and ASM 3.x library. Licensed under ZLIB license.

More details are available in JavaDoc:
http://jezek2.advel.cz/tmp/jstackalloc/javadoc/

And the latest version of library can be downloaded here:
http://jezek2.advel.cz/tmp/jstackalloc/jstackalloc-20080716.zip

(The original version posted in this thread can be found here)

Example ANT task (for NetBeans):


<target name="instrument-classes">
    <taskdef name="instrument-stack"
        classname="cz.advel.stack.instrument.InstrumentationTask"
        classpath="${run.classpath}">
    </taskdef>

    <instrument-stack dest="${build.classes.dir}" packageName="your.package.name">
        <fileset dir="${build.classes.dir}" includes="**/*.class"/>
    </instrument-stack>
</target>

<target name="-post-compile" depends="instrument-classes">
</target>

Possible future enhancements:

  • allow selective storing of per-thread stack instance in protected final field and initialize it automatically in constructor (this is the system used in JBullet)
  • allow passing parameters to set method when obtaining new instance from Stack.alloc(), this is problematic because it would need to use varargs method which would lose type safety (the heap allocations created by using varargs and autoboxing is not problem, because it would be entirelly replaced with direct call to set method)
  • add optional single-thread mode, where stack instance is obtained from public static field instead of using ThreadLocal

EDIT 2008/06/23: updated download to latest version and edited possible enchancements

I just discover JBullet and it looks great, but that’s not the thread subject…

I must have a brain bug, but I have difficulties to understand the goal of this library, sry ?

The goal is to “stack allocate” (implementation uses object pool) temporary objects (such as vectors) so you don’t need to create a lot of garbage or have temporary objects stored in static fields which are not thread-safe, and to have programmer friendly API that minimizes creation of lot of redudant code and thus reduce mistakes.

About the garbage creation, the problem is not in allocating new objects or garbage collection per se, in fact HotSpot is very good in this. But when your garbage creation exceeds some range (like in JBullet) the GC is called very often, eg. 10x or more per second and it badly affects performance, both in throughput and frame to frame jerkiness which is very visible in game. Also as JBullet is only a library and not final application, it must count with garbage produced by other libraries and final application.

The library tradeoffs some performance for nearly constant behaviour (very similar to realtime system definition). But because the code is generated it can generate more optimized code which would be pretty ugly in source code if written by hand.

The library also supports disabled mode, where no stack allocation occurs, but instead the method calls are replaced with normal heap allocation. This is good for debugging, comparing both allocation schemes and it’s also prepared to day when hotspot will handle stack allocation (or some different optimization with similar effect) on it’s own.

thank you, I now understand the whole now. but your second explanation is far away better for me :slight_smile:

very good work jezek2!

But how do you determine what to stack allocate and what not? Even escape analysis are not enough for doing that (but probably the first thing you do).

(copied from doc)


 public static Vector3f average(Vector3f v1, Vector3f v2) {
     Vector3f tmp = Stack.alloc(Vector3f.class);
     tmp.add(v1, v2);
     tmp.scale(0.5f);
     return Stack.copyOut(tmp);
 }

to stack allocate this it would require to figure out whether add(), scale() static{} and the constructor of Vector3f have “side effects” or some kind of state. In other words how do you distinct value objects from others?

have you already thought about annotations?


 @PostCompile(stackallocation=true)
 public static Vector3f average(Vector3f v1, Vector3f v2) {
..
 }

Thanks :slight_smile:

That’s simple, stack allocated are only objects obtained by Stack.alloc(), simple as that :slight_smile: So it’s up to the programmer how he maintain things. This is entirelly local modification, so global optimization such as escape analysis is not needed.

The goal is not to create escape analysis, it’s too much work for the possible gain (in pure Java) and bad for libraries and runtime loading of classes (unlike VM which can dynamically deoptimize code if newly loaded class is in collision with some optimization). Also the build process would be much slower.

Ah I see,

I just realized I understood it completely wrong :-
I initially thought this “post processor” could be applied after compilation without any manual changes in source code and it would determine by itself which methods are good candidates for stack allocations.

sorry for the confusion

If we ignore the bytecode instrumentation part, how does it compare to Javolution’s StackContext?

I would say less overhead, ie. stack is obtained from ThreadLocal only once per method, concrete methods are generated for each type, so minimum code is used for each case and no reflection involved (though it seems that javolution uses it only in ‘generic’ cases).

New version available:
http://jezek2.advel.cz/tmp/jstackalloc/jstackalloc-20080623.zip

Also see the updated JavaDoc:
http://jezek2.advel.cz/tmp/jstackalloc/javadoc/

Changes in release 20080623:

  • Removed obsolete code and fixed closing of files
  • Added single thread mode
  • Removed copyOut method
  • Disabled stack allocation in suspendable methods when using Matthias Mann’s Continuations library

The main change is that copyOut method was removed, it used two copies for most usages and had error prone behaviour by using per-thread static instance to pass values out. Using output parameter is preferred now, see the example in the JavaDoc.

I don’t understand. It is already possible to select a better algorithm for the garbage collection since Java 1.4, a concurrent low pause collector by using the option -XX:+UseConcMarkSweepGC, isn’t it?

http://java.sun.com/docs/hotspot/gc1.4.2/

It is already possible to set a limit to the duration of pause since Java 1.5 by using -XX:MaxGCPauseMillis, isn’t it?

http://articles.techrepublic.com.com/5100-10878_11-6108296.html

[quote]I don’t understand. It is already possible to select a better algorithm for the garbage collection since Java 1.4, a concurrent low pause collector by using the option -XX:+UseConcMarkSweepGC, isn’t it?
[/quote]
When doing physics engine stuff you generate so much object garbage that it pretty much doesn’t matter what garbage collection algorithm is in use, you’re generating such huge piles of crap that no garbage collector could be expected to handle it all very well. These are generally objects that were never created with the intention of lasting past the current block scope, so it makes perfect sense to have your own stack to handle them since Java offers no way to let you put stuff on the “real” stack.

Plus, with such tiny objects (a vector is mainly just a holder for a few values), and especially when you might be creating hundreds of thousands of them per frame, the extra microseconds that it takes to allocate and initialize the object meta-data can really add up, so I would argue that garbage is not the entire problem, even if it is the main one.

I think this is pretty valid argument. I’m really concerned about allocations because of what ewjordan said, but I can be wrong. Maybe the VM is already capable of it. The tool is still very valuable for me, because I can now switch easily between different allocation schemes (or create new one, like automatically storing it in fields instead of using the emulated stack, which would help source readability a lot).

Also I would like to test it on final game when it’s done, to see real results and be prepared for VM not handling it enough on it’s own. And I would like to work it nicely even for older HW, eg. single core machines bought 4+ years ago with at least 512MB RAM (and I need that memory mainly for data, not for overhead of handling of allocations).

I’ve noticed that the ConcMarkSweepGC took much more memory than default GC, I also didn’t tested how it perform when you use CPU a lot, there is a work to be done due to allocations, it’s useless and the cost must be somewhere, that’s the motivation behind this.

New version available:
http://jezek2.advel.cz/tmp/jstackalloc/jstackalloc-20080703.zip

Changes in release 20080703:

  • Fixed generation of stack class on first intrumentation run
  • Added support for storing value objects in static fields instead of stack

It can now store objects in static fields, the idea is to profile your code and look for top hotspot methods by invocation count, and if it makes sense mark them with StaticAlloc annotation (see it’s description carefully).

About performance, I’m using this library in JBullet and it looks like it’s slightly faster than heap allocation when there is not excessive amount of allocation per method. The better part is that it doesn’t seem to be slower than normal heap allocation, and the GC does nearly nothing :slight_smile: The good thing is that I can always easily compare different approaches with just build option.

New version available:
http://jezek2.advel.cz/tmp/jstackalloc/jstackalloc-20080716.zip

Changes in release 20080716:

  • Fixed wrong generation of “get” method in stack when in single thread mode
  • Added isolated mode and documented InstrumentationTask

I think i’ve implemented everything correctly, but i’m getting an error:

[quote]BUILD FAILED
/path_to_file/build.xml:13: java.lang.Error: not instrumented
[/quote]
What could this mean? It’s fairly nondescript.

Edit:
What doesn’t make sense is that this is the actual build file, in the instrumentation task

Also, it would be great if there could be a way to “recycle” objects, so they are placed back on the stack. Maybe require a set() method with no arguments?
This would eliminate a lot of object construction.

The only thing that comes to my mind is that you’re using stack allocation in static intializer of some class.

Actually they’re recycled automatically. That’s what the bytecode instrumentation part does, see documentation for Stack class. There is rough example what the instrumentation does.

Thanks that’s probably the problem then.

Does that mean that objects allocated with Stack.alloc(class) are not guaranteed to be an result of the () constructor, if they’re recycled? Like, if the instrumentation recycled a vector of <2,3,4>, and then you grabbed it with the Stack.alloc(class), would it still have <2,3,4>?
I guess i misunderstood this system. So every object that you call Stack.alloc() for, it reuses?

Edit:
I added

Error e = new Error("not instrumented");
e.printStackTrace();
throw e;

To the stack class so I could trace where it happened (ant doesn’t print stack traces), which worked great. I also had to clean it every time after these errors, for some reason.

Exactly, you must not rely on previous state of the value objects for this reason. Constructor is called only when there is no object present in recycle list of “stack”.

Excellent work, jezek! JBullet at all, and JStackAlloc particular is very interesting. But I have a thought and idea – why it’s so explicit in code level? If you already use so powerfull method as bytecode instrumentation, why not use it like this:


final Vector3f v = @StackAllocation new Vector3f(...);

and translate in into the same magic on instrumentation phase? From my current point of view, annotations gives you the same possibilities, as your explicit Stack.alloc, but they

  • have no influence on perfomance in case code not instrumented. They just slightly increase class-file size
  • do not require instrumentation. If code not instrumented, it just execute as written.
  • one can use instrumentation in realtime – on classloader level, and even switch between optimized/not optimized version in realtime!
  • it’s subject for more compile-time checking. For example, code like Vector3f = @StackAlloc new Vector3f(v1) simply won’t compile, if Vector3f does not have copy constructor. (Unfortunately, it does not save us from missing or misimplementing .set(Vector3f) method…)
  • it’s declarative, for me it looks much more elegant

That do you think?

Ruslan