How to do object pooling right?

Varkas · January 22, 2013, 11:31am

Since my raytracer needs to create a lot of vector objects, I thought it might be a good idea to create a vector pool. Surprisingly, it wasn’t. Or maybe I’m just too stupid.

I tried an ArrayList as pool container and pool access like this (V3 is a math. 3d vector type):


    private static final ArrayList <V3> pool = new ArrayList<V3>(8192);
    
    public static V3 make(final V3 other)
    {
        V3 v= null;
        synchronized(pool)
        {
            if(pool.size() > 0)
            {
                v= pool.remove(pool.size()-1);
            }
        }

        if(v== null)
        {
            v= new V3(other);
        }
        
        return v;
    }
    
    public static void put(final V3 v)
    {
        synchronized(pool)
        {
            pool.add(v);
        }        
    }

This turned out to be much slower than just "new()"ing all the objects and rely on the gc to clean up well. Has object pooling become obsolete with modern JVMs? Am I doing it wrong? Is there a sort of pool container which work without synchronization?

Roquen · January 22, 2013, 11:51am

Do you have numbers to show you need to think about this? Remember that we now have escape analysis & scalar replacement (as least for trivial cases…which I’d except is what most of your vector allocations are going to look like).

deepthought · January 22, 2013, 11:55am

Try kappa’s bag. It’ll work a lot faster than an array list.

Grunnt · January 22, 2013, 11:58am

[quote=“Varkas,post:1,topic:40778”]
I assume you are using threads, right? Otherwise I wouldn’t bother with synchronization. I also think that removing objects from an ArrayList is actually relatively slow, and you might be better off using a Bag for this (as order does not matter for an object pool), e.g. the one used in the Artemis framework or as described here on the forum.

Varkas · January 22, 2013, 12:00pm

The old state was 24 FPS.

Then I had a need to create a new vector object as a scaled copy of another vector in one of the innermost loops. Well, two copies of two vectors. With the new + copy operation it dropped to 18 FPS.

I figured it’s the object creation (but I didn’t profile it) and implemented the pool. I first thought I had made some sort of mistake, FPS dropped to 2. I then modified the code to have no pooling, but calls to the pool methods:

    public static V3 make(final V3 other)
    {
        V3 v= null;
        /*
        synchronized(pool)
        {
            if(pool.size() > 0)
            {
                mine = pool.remove(pool.size()-1);
            }
        }
        */
        // System.err.println("Vectors in pool: " + pool.size());
        
        if(mine == null)
        {
            v= new V3(other);
        }
        else
        {
            v.set(other);
        }
        
        return v;
    }
    
    public static void put(final V3 v)
    {
        /*
        synchronized(pool)
        {
            pool.add(v);
        } 
        */
    }

With that I was back at 18 FPS.

So my findings:

2x new is slow, and makes my FPS drop from 24 to 18 in the used scene
pooling is slow, due to the synchronized blocks.

I’m looking for a faster pool concept to get rid of the cost for 2x new() of the vector in the inner loop, to get back to about 24 FPS again (minus the actual cost to scalke the vectors).

@Grunnt: Yes, the raytracer uses threads, currently 8 threads are using the vector pool in parallel. I could give each thread a pool, maybe that works better, but it complicates the code somewhat. Removing elements from the array list is slow unless it’s the last element. At least I think so. That’s why my pool takes elements from the back of the list.

Roquen · January 22, 2013, 12:22pm

If you’re going to pool, you want one pool allocator per thread…not shared. Two reasons: lower probability that memory within each pool are near to each other and no concurrency issues (and no currency ops). Oh yeah…and small objects suck, so the fastest solution would be to have a flat array per thread and just work in-place.

EDIT: And there is the middle-ground of fake structs a la Riven’s lib

theagentd · January 22, 2013, 12:45pm

Ditch the synchronization and use ThreadLocal!

I think the main reason for using pooling is to reduce the load on the GC. This is especially important for games since huge pauses all the time are unacceptable. The performance hit is an acceptable cost for that.

Roquen · January 22, 2013, 12:49pm

Yes to no sync … but I don’t see a reason to use threadlocal…accessing isn’t free…just have the manager in the worker thread instance.

ClickerMonkey · January 22, 2013, 1:07pm

Here’s my memory pool:


public class MemoryPool<T> implements Memory<T>
{

	private Factory<T> factory;
	private T[] pool;
	private int size;
	private int maximum;
	private int increase;

	public MemoryPool( T[] pool, Factory<T> factory, int maximum, int increase, boolean fill )
	{
		this.pool = pool;
		this.factory = factory;
		this.maximum = maximum;
		this.increase = increase;
		
		if ( fill ) {
			while ( size < pool.length ) {
				pool[size++] = factory.create();
			}
		}
	}
	
	
	@Override
	public T alloc()
	{
		return (size == 0 ? factory.create() : pool[--size]);
	}

	@Override
	public void free( T item )
	{
		if (size < pool.length) {
			pool[size++] = item;
		} else if (size < maximum) {
			pool = Arrays.copyOf( pool, size + increase );
			pool[size++] = item;
 		}
	}

}

If I have different threads, I declare a staticly accessible pool per thread so I don’t share between threads or need to use synchronize. ThreadLocal can be fine, as far as performance goes it has been improved on, but it won’t be better than just placing/passing an instance of the Pool around appropriately.

princec · January 22, 2013, 1:09pm

General rule of thumb - don’t use pools unless construction is expensive (that is, you do work in the constructor), or the class size is “large” (say, over 1kb per instance), or you are running on Android. Otherwise you’ll find you are more or less duplicating the work of JVM.

Cas

ClickerMonkey · January 22, 2013, 1:12pm

I would have to agree with this in most cases, however there are instances that once you allocate a number of objects within a time frame it will cause GC unnecessarily (something like that). JBullet (the Java port to Bullet) was experiencing this so they created their own pooling mechanism (using bytecode injection, so they could always guarantee an object would be put back on the pool without the programmer needing to explicitly do it). They pool vectors, matrices, quaternions, etc. It all depends on the size of the object in memory and how many you want each frame (in essence).

princec · January 22, 2013, 1:14pm

Yes, you’re right; if you are creating over 100kb* of instances per frame, use a pool.

Cas

(some arbitrary figure but you get the idea - profile first!)

Varkas · January 22, 2013, 1:27pm

My tests support this. With a ThreadLocal pool I was at 13 FPS (worse than with a direct new() and pure gc) and ThreadLocal.get() was a fairly expensive call in this scenario.

'tis happening “pixels x scene objects x light sources” which can result in a few million per frame. But it still seems new() and gc are the fastest of the tested scenarios. Keeping the pool as part of the WorkerThread and handing a reference to all methods which need to call V3.make() seems unwieldy though, and I think I stick with a plain new() instead. Vectors are quite cheap to generate.

Some day I assume I’ll make V3 instances references into an array of doubles. That might be the best solution overall.

Roquen · January 22, 2013, 1:48pm

As I mentioned earlier…there’s a good chance the ‘new’ and initialization isn’t occurring in a fair number of cases due to escape analysis and scalar replacement.

Varkas · January 22, 2013, 1:53pm

I don’t know what these terms mean. Can you explain briefly, or give me pointers to explanations? Thanks

ClickerMonkey · January 22, 2013, 2:16pm

I would still suggest using an array as a stack approach like I recommended… since newing in java is a relatively cheap operation you need to ensure your alloc and free logic are as efficient as possible. Even if your pool takes a little more time to alloc when compared to newing in raw time remember you save time by avoiding calling the GC (well, frequency of GC can drastically decreases).

princec · January 22, 2013, 3:03pm

The very latest Java 7 Server VM can detect objects which only live in the scope of, say, a for() loop (“escape analysis”), and they are never allocated on the heap at all, but are instead flattened on to the stack, just like a C++ programmer might choose to do manually (“scalar replacement”). It makes for some great speedups with GC-bound algorithms.

Escape analysis also removes synchronizeds where it can, too. It’s pretty nifty.

BUT: As I say… latest Java 7 Server VM only. Oh, and Excelsior JET.

Cas

Roquen · January 22, 2013, 3:05pm

More info:

First off to see if either of these are helping you, try running with them disabled: -XX:-DoEscapeAnalysis. Also you might want to periodically run from the command line instead of directly through whatever IDE…cause you want to see what default (or you’re chosen) options are giving you instead what you might be inheriting from the IDE.

Very inexact descriptions: Escape analysis attempts to determine if it’s impossible for an object to out-live the stack frame of when it’s created. If an object doesn’t escaped, then it cannot outlive the creating frame and further transforms are possible. The most common is to not use the standard heap to allocate and use a very cheap ‘stack allocator’ (potentially the call stack) instead. This along with initialization are the creation overhead. Further hotspot is an aggressive inliner so for many simple classes escaped instances will have initialization inlined dropping overhead to close to what one would have if all the fields of the object were simply local variables.

Scalar replacement is another potential transform that can be applied to an escaped instance. That’s where it’s determined that the object itself can disappear and be replace by the equivalent of just its (used) fields on the stack.

EDIT: It’s too bad that nobody distributes a precompiled debug version of hotspot, as it has useful diagnostics…like dumping info about what has been transformed. (such as -XX:+PrintEliminateAllocations in this case).

sproingie · January 22, 2013, 3:42pm

It’s in 1.6 but required the -XX:+DoEscapeAnalysis option, whereas I believe it’s on by default in 1.7. And yes, server VMs only, and not Dalvik at all.

princec · January 22, 2013, 3:48pm

I’m not sure it actually did much in 1.6 but I stand to be corrected.

Cas