How to do object pooling right?

Since my raytracer needs to create a lot of vector objects, I thought it might be a good idea to create a vector pool. Surprisingly, it wasn’t. Or maybe I’m just too stupid.

I tried an ArrayList as pool container and pool access like this (V3 is a math. 3d vector type):


    private static final ArrayList <V3> pool = new ArrayList<V3>(8192);
    
    public static V3 make(final V3 other)
    {
        V3 v= null;
        synchronized(pool)
        {
            if(pool.size() > 0)
            {
                v= pool.remove(pool.size()-1);
            }
        }

        if(v== null)
        {
            v= new V3(other);
        }
        
        return v;
    }
    
    public static void put(final V3 v)
    {
        synchronized(pool)
        {
            pool.add(v);
        }        
    }

This turned out to be much slower than just "new()"ing all the objects and rely on the gc to clean up well. Has object pooling become obsolete with modern JVMs? Am I doing it wrong? Is there a sort of pool container which work without synchronization?

Do you have numbers to show you need to think about this? Remember that we now have escape analysis & scalar replacement (as least for trivial cases…which I’d except is what most of your vector allocations are going to look like).

Try kappa’s bag. It’ll work a lot faster than an array list.

[quote=“Varkas,post:1,topic:40778”]
I assume you are using threads, right? Otherwise I wouldn’t bother with synchronization. I also think that removing objects from an ArrayList is actually relatively slow, and you might be better off using a Bag for this (as order does not matter for an object pool), e.g. the one used in the Artemis framework or as described here on the forum.

The old state was 24 FPS.

Then I had a need to create a new vector object as a scaled copy of another vector in one of the innermost loops. Well, two copies of two vectors. With the new + copy operation it dropped to 18 FPS.

I figured it’s the object creation (but I didn’t profile it) and implemented the pool. I first thought I had made some sort of mistake, FPS dropped to 2. I then modified the code to have no pooling, but calls to the pool methods:

    public static V3 make(final V3 other)
    {
        V3 v= null;
        /*
        synchronized(pool)
        {
            if(pool.size() > 0)
            {
                mine = pool.remove(pool.size()-1);
            }
        }
        */
        // System.err.println("Vectors in pool: " + pool.size());
        
        if(mine == null)
        {
            v= new V3(other);
        }
        else
        {
            v.set(other);
        }
        
        return v;
    }
    
    public static void put(final V3 v)
    {
        /*
        synchronized(pool)
        {
            pool.add(v);
        } 
        */
    }

With that I was back at 18 FPS.

So my findings:

  • 2x new is slow, and makes my FPS drop from 24 to 18 in the used scene
  • pooling is slow, due to the synchronized blocks.

I’m looking for a faster pool concept to get rid of the cost for 2x new() of the vector in the inner loop, to get back to about 24 FPS again (minus the actual cost to scalke the vectors).

@Grunnt: Yes, the raytracer uses threads, currently 8 threads are using the vector pool in parallel. I could give each thread a pool, maybe that works better, but it complicates the code somewhat. Removing elements from the array list is slow unless it’s the last element. At least I think so. That’s why my pool takes elements from the back of the list.

If you’re going to pool, you want one pool allocator per thread…not shared. Two reasons: lower probability that memory within each pool are near to each other and no concurrency issues (and no currency ops). Oh yeah…and small objects suck, so the fastest solution would be to have a flat array per thread and just work in-place.

EDIT: And there is the middle-ground of fake structs a la Riven’s lib

Ditch the synchronization and use ThreadLocal!

I think the main reason for using pooling is to reduce the load on the GC. This is especially important for games since huge pauses all the time are unacceptable. The performance hit is an acceptable cost for that.

Yes to no sync … but I don’t see a reason to use threadlocal…accessing isn’t free…just have the manager in the worker thread instance.

Here’s my memory pool:


public class MemoryPool<T> implements Memory<T>
{

	private Factory<T> factory;
	private T[] pool;
	private int size;
	private int maximum;
	private int increase;

	public MemoryPool( T[] pool, Factory<T> factory, int maximum, int increase, boolean fill )
	{
		this.pool = pool;
		this.factory = factory;
		this.maximum = maximum;
		this.increase = increase;
		
		if ( fill ) {
			while ( size < pool.length ) {
				pool[size++] = factory.create();
			}
		}
	}
	
	
	@Override
	public T alloc()
	{
		return (size == 0 ? factory.create() : pool[--size]);
	}

	@Override
	public void free( T item )
	{
		if (size < pool.length) {
			pool[size++] = item;
		} else if (size < maximum) {
			pool = Arrays.copyOf( pool, size + increase );
			pool[size++] = item;
 		}
	}

}

If I have different threads, I declare a staticly accessible pool per thread so I don’t share between threads or need to use synchronize. ThreadLocal can be fine, as far as performance goes it has been improved on, but it won’t be better than just placing/passing an instance of the Pool around appropriately.

General rule of thumb - don’t use pools unless construction is expensive (that is, you do work in the constructor), or the class size is “large” (say, over 1kb per instance), or you are running on Android. Otherwise you’ll find you are more or less duplicating the work of JVM.

Cas :slight_smile:

I would have to agree with this in most cases, however there are instances that once you allocate a number of objects within a time frame it will cause GC unnecessarily (something like that). JBullet (the Java port to Bullet) was experiencing this so they created their own pooling mechanism (using bytecode injection, so they could always guarantee an object would be put back on the pool without the programmer needing to explicitly do it). They pool vectors, matrices, quaternions, etc. It all depends on the size of the object in memory and how many you want each frame (in essence).

Yes, you’re right; if you are creating over 100kb* of instances per frame, use a pool.

Cas :slight_smile:

  • (some arbitrary figure but you get the idea - profile first!)

My tests support this. With a ThreadLocal pool I was at 13 FPS (worse than with a direct new() and pure gc) and ThreadLocal.get() was a fairly expensive call in this scenario.

'tis happening “pixels x scene objects x light sources” which can result in a few million per frame. But it still seems new() and gc are the fastest of the tested scenarios. Keeping the pool as part of the WorkerThread and handing a reference to all methods which need to call V3.make() seems unwieldy though, and I think I stick with a plain new() instead. Vectors are quite cheap to generate.

Some day I assume I’ll make V3 instances references into an array of doubles. That might be the best solution overall.

As I mentioned earlier…there’s a good chance the ‘new’ and initialization isn’t occurring in a fair number of cases due to escape analysis and scalar replacement.

I don’t know what these terms mean. Can you explain briefly, or give me pointers to explanations? Thanks :slight_smile:

I would still suggest using an array as a stack approach like I recommended… since newing in java is a relatively cheap operation you need to ensure your alloc and free logic are as efficient as possible. Even if your pool takes a little more time to alloc when compared to newing in raw time remember you save time by avoiding calling the GC (well, frequency of GC can drastically decreases).

The very latest Java 7 Server VM can detect objects which only live in the scope of, say, a for() loop (“escape analysis”), and they are never allocated on the heap at all, but are instead flattened on to the stack, just like a C++ programmer might choose to do manually (“scalar replacement”). It makes for some great speedups with GC-bound algorithms.

Escape analysis also removes synchronizeds where it can, too. It’s pretty nifty.

BUT: As I say… latest Java 7 Server VM only. Oh, and Excelsior JET.

Cas :slight_smile:

More info:

First off to see if either of these are helping you, try running with them disabled: -XX:-DoEscapeAnalysis. Also you might want to periodically run from the command line instead of directly through whatever IDE…cause you want to see what default (or you’re chosen) options are giving you instead what you might be inheriting from the IDE.

Very inexact descriptions: Escape analysis attempts to determine if it’s impossible for an object to out-live the stack frame of when it’s created. If an object doesn’t escaped, then it cannot outlive the creating frame and further transforms are possible. The most common is to not use the standard heap to allocate and use a very cheap ‘stack allocator’ (potentially the call stack) instead. This along with initialization are the creation overhead. Further hotspot is an aggressive inliner so for many simple classes escaped instances will have initialization inlined dropping overhead to close to what one would have if all the fields of the object were simply local variables.

Scalar replacement is another potential transform that can be applied to an escaped instance. That’s where it’s determined that the object itself can disappear and be replace by the equivalent of just its (used) fields on the stack.

EDIT: It’s too bad that nobody distributes a precompiled debug version of hotspot, as it has useful diagnostics…like dumping info about what has been transformed. (such as -XX:+PrintEliminateAllocations in this case).

It’s in 1.6 but required the -XX:+DoEscapeAnalysis option, whereas I believe it’s on by default in 1.7. And yes, server VMs only, and not Dalvik at all.

I’m not sure it actually did much in 1.6 but I stand to be corrected.

Cas :slight_smile: