Another way to allocate small object

A simple test on how to allocate/free small object without any GC call.

Vector2DGC.java


public class Vector2DGC
{
	private Vector2DGC_cell free[];
	private Vector2DGC_cell used;
	private int freeOfs;
	
	public Vector2DGC(int size)
	{
		this.free=new Vector2DGC_cell[size];
		this.used=null;	
		this.freeOfs=size-1;
		for(int x=0;x<size;x++)	
			this.free[x]=new Vector2DGC_cell();

	}
	
	public Vector2DGC_cell newVector2D()
	{
		
		//remove from free list
		Vector2DGC_cell vc=this.free[this.freeOfs--];
		
		
		
		//add to used list
		vc.next=this.used;
		vc.prev=null;
		if(this.used!=null)
			this.used.prev=vc;
		this.used=vc;
		
		//initialise object here if needed
		vc.v.x=0.0;
		vc.v.y=0.0;
		
				
		return vc;
	}
	
	public void freeVector2D(Vector2DGC_cell vc)
	{
		//remove from used list
		Vector2DGC_cell prev=vc.prev;
		Vector2DGC_cell next=vc.next;
		if(prev!=null)
			prev.next=vc.next;
		if(next!=null)
			next.prev=vc.prev;
			
		//add to free list	
		this.free[++freeOfs]=vc;	
	}

			
	public class Vector2DGC_cell
	{
		
		public Vector2D v;
		public Vector2DGC_cell prev;
		public Vector2DGC_cell next;
		
		public Vector2DGC_cell()
		{
			this.v=new Vector2D();
			this.prev=null;
			this.next=null;		
		}
			
			
		
		
	}	
	
}

Stack.java


/**
 * @(#)Stack.java
 *
 * Sample Applet application
 *
 * @author Bruno Augier
 * @version 1.00 08/03/10
 */
 
import java.awt.*;
import java.applet.*;

public class Stack extends Applet 
{
	Vector2DGC vgc;
	public void init() 
	{
		long sTime;
		long eTime;
		int max=500000;
		this.vgc=new Vector2DGC(max);
		
		for(int x=0;x<1000000;x++)
		{
			Thread.yield();
		}
		
		
		//compute ops on Vector2D using standard way
		sTime=System.currentTimeMillis();
		Vector2D v1[]=new Vector2D[max];
		for(int x=0;x<max;x++)
		{
			v1[x]=new Vector2D();
			
		}
		eTime=System.currentTimeMillis();
		long standardTime=eTime-sTime;
		
		
		//compute ops on Vector2D using stack way
		sTime=System.currentTimeMillis();
		Vector2D v2[]=new Vector2D[max];
		Vector2DGC.Vector2DGC_cell vc[]=new Vector2DGC.Vector2DGC_cell[max];
		for(int x=0;x<max;x++)
		{
			vc[x]=this.vgc.newVector2D();
			v2[x]=vc[x].v;
			
		}
		eTime=System.currentTimeMillis();
		long stackTime=eTime-sTime;	
		
		//compute free ops on Vector2D using stack way
		sTime=System.currentTimeMillis();
		for(int x=0;x<max;x++)
		{
			vgc.freeVector2D(vc[x]);					
		}
		eTime=System.currentTimeMillis();
		long freeStackTime=eTime-sTime;	
		
		
		System.out.println("Standard alloc time = " + standardTime);
		System.out.println("Stack alloc time = " + stackTime);
		System.out.println("Stack free time = " + freeStackTime);
	}

	public void paint(Graphics g) 
	{
		
	}
}

Vector2D.java

public class Vector2D
{
	public double x;
	public double y;
	
	public void someMethod()
	{
		
		
		
	}
	
	
}

results on my AMD 1.7Ghz with Java 1.4.2-02:
Standard alloc time = 609
Stack alloc time = 94
Stack free time = 47

results on my AMD 1.7Ghz with Java 1.6.2_02::
Standard alloc time = 203
Stack alloc time = 94
Stack free time = 31

seems that it dont help a lot for the 1.6 JVM , but the Stack way will avoid GC call in the middle of your favorite game so it may still help

This is a technique known as object pooling. There are third-party libraries available for this as well.

[quote]This is a technique known as object pooling. There are third-party libraries available for this as well.
[/quote]
heu… yup I know…

just want to give a simple explanation on how it can be implemented and why it is not the hell to make it.

What is the motive for maintaining that linked list of used objects? Why not just use the array of free objects? Also, you’re not setting free[freeOfs] to null when you remove an object from it.

What is the motive for maintaining that linked list of used objects? 

to be true I cant remenber :), I wrote it in a row of 25 min… so I may have a good reason when I started it :slight_smile: but I forget the main reaon of this object linkage

free[freeOfs] to null when you remove 

ofcourse this is to avoid GC call! and ensure that a new alloc request will work!

EDIT:
sry, I misunderstood point 2, setting to null is unecessary , no ?

As long as the caller is using the object, it will not be collected even if the element in the array is set to null. The object is returned to the array when it is set free. So at all times, either the caller or the pool will have a reference to the object.

In this case setting the array elements to null might not matter (even if the order of allocating and freeing the objects differs), but in general such things may cause memory leak bugs.

Your pool will throw an exception if the pool becomes empty or full. Is this by design? At least it will be easy to detect if not all objects are freed - the game just crashes.

Here is my version of a pool. It will handle an empty/full pool gracefully. My tests show it to be about 30% faster than your code, probably because of not having such a linked list.


public class Vector2DPool {

    private final Vector2D[] pool;
    private int current;

    public Vector2DPool(int size) {
        pool = new Vector2D[size];
        for (int i = 0; i < pool.length; i++) {
            pool[i] = new Vector2D();
        }
        current = pool.length - 1;
    }

    public Vector2D get() {
        Vector2D v;
        if (current >= 0) {
            v = pool[current];
            pool[current] = null;
            current--;
        } else {
            v = new Vector2D();
        }
        v.x = 0.0;
        v.y = 0.0;
        return v;
    }

    public void free(Vector2D v) {
        if (current + 1 < pool.length) {
            current++;
            pool[current] = v;
        }
    }
}

nice, this version is more simple and more efficient

Here is the benchmark which I used. The output on jdk1.6.0_03 (-server), WinXP, Core 2 Q6600, 3.5GB RAM:


Benchmark Results
1: Standard alloc                128.63 ns    (min 106.22 ns, max 145.32 ns)
2: Vector2DGC: alloc + free        9.78 ns    (min   9.33 ns, max  10.25 ns)
3: Vector2DPool: alloc + free      6.53 ns    (min   6.29 ns, max   6.54 ns)
4: (nothing)                       3.50 ns    (min   3.50 ns, max   3.50 ns)

The Benchmark class is a utility which I made some time ago. Maybe you’ll find it useful. 8) Suggestions for improvement are also welcome, especially if you find out that it produces inaccurate results.


import net.orfjackal.tools.Benchmark;

/**
 * @author Esko Luontola
 * @since 10.3.2008
 */
public class Vector2DBench {

    private static int x = 0;
    private static final int SIZE = 100000;

    public static void main(String[] args) throws InterruptedException {

        final Vector2DGC vgc = new Vector2DGC(SIZE);
        final Vector2DPool pool = new Vector2DPool(SIZE);

        Benchmark b = new Benchmark();
        b.runBenchmark("Standard alloc", new Runnable() {
            public void run() {
                Vector2D vector2D = new Vector2D();
                x += vector2D.hashCode();
            }
        });
        b.runBenchmark("Vector2DGC: alloc + free", new Runnable() {
            public void run() {
                Vector2DGC.Vector2DGC_cell cell = vgc.newVector2D();
                x += cell.v.hashCode();
                vgc.freeVector2D(cell);
            }
        });
        b.runBenchmark("Vector2DPool: alloc + free", new Runnable() {
            public void run() {
                Vector2D vector2D = pool.get();
                x += vector2D.hashCode();
                pool.free(vector2D);
            }
        });
        b.runBenchmark("(nothing)", new Runnable() {
            Vector2D tmp = new Vector2D();

            public void run() {
                x += tmp.hashCode();
            }
        });
        b.printResults();
    }
}


/*
 * Copyright (c) 2008, Esko Luontola. All Rights Reserved.
 */

package net.orfjackal.tools;

import java.text.DecimalFormat;
import java.text.DecimalFormatSymbols;
import java.text.NumberFormat;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;
import java.util.Locale;

/**
 * @author Esko Luontola
 * @since 30.1.2008
 */
public class Benchmark {

    private static final Runnable NULL_BENCHMARK = new Runnable() {
        public void run() {
        }
    };

    private final int measureRounds;
    private final int minimumDurationMs;
    private final List<Result> results = new ArrayList<Result>();

    public Benchmark() {
        this(5, 500);
    }

    public Benchmark(int measureRounds, int minimumDurationMs) {
        this.measureRounds = measureRounds;
        this.minimumDurationMs = minimumDurationMs;
    }

    public Result runBenchmark(String description, Runnable benchmark) {
        int repeats = autoRepeatCount(benchmark, minimumDurationMs);
        long[] durations = finalMeasurements(benchmark, repeats, measureRounds);
        long[] measurementNoise = finalMeasurements(NULL_BENCHMARK, repeats, measureRounds);
        Result result = new Result(description, repeats, durations, measurementNoise);
        results.add(result);
        return result;
    }

    private static int autoRepeatCount(Runnable benchmark, int minimumDurationMs) {
        int repeats = 1;
        long durationMs = 0;
        while (durationMs < minimumDurationMs) {
            long start = System.currentTimeMillis();
            repeat(repeats, benchmark);
            long end = System.currentTimeMillis();

            durationMs = end - start;
            if (durationMs < minimumDurationMs) {
                repeats = repeats * 2;
            }
        }
        return repeats;
    }

    private static long[] finalMeasurements(Runnable benchmark, int repeats, int measureRounds) {
        long[] durations = new long[measureRounds];
        for (int i = 0; i < durations.length; i++) {
            long start = System.currentTimeMillis();
            repeat(repeats, benchmark);
            long end = System.currentTimeMillis();
            durations[i] = end - start;
        }
        return durations;
    }

    private static void repeat(int repeats, Runnable benchmark) {
        for (int i = 0; i < repeats; i++) {
            benchmark.run();
        }
    }

    public void printResults() {
        DecimalFormat nf = getNumberFormat();
        int indexMaxLength = (results.size() + 1) / 10 + 1;
        int descMaxLength = 0;
        int nanosMaxLength = 0;
        for (Result result : results) {
            descMaxLength = Math.max(descMaxLength, result.getDescription().length());
            nanosMaxLength = Math.max(nanosMaxLength, nf.format(result.getMaxNanos()).length());
        }
        System.out.println("Benchmark Results");
        for (int i = 0; i < results.size(); i++) {
            Result result = results.get(i);
            String index = pad(i + 1, -indexMaxLength, " ");
            String desc = pad(result.getDescription(), descMaxLength, " ");
            String nanos = pad(nf.format(result.getNanos()), -nanosMaxLength, " ");
            String minNanos = pad(nf.format(result.getMinNanos()), -nanosMaxLength, " ");
            String maxNanos = pad(nf.format(result.getMaxNanos()), -nanosMaxLength, " ");
            System.out.println(index + ": " + desc + "    " + nanos + " ns"
                    + "    (min " + minNanos + " ns, max " + maxNanos + " ns)");
        }
    }

    private static DecimalFormat getNumberFormat() {
        return new DecimalFormat("0.00", DecimalFormatSymbols.getInstance(Locale.ENGLISH));
    }

    private static String pad(Object str, int padlen, String pad) {
        String padding = "";
        int len = Math.abs(padlen) - str.toString().length();
        if (len < 1) {
            return str.toString();
        }
        for (int i = 0; i < len; ++i) {
            padding = padding + pad;
        }
        return (padlen < 0 ? padding + str : str + padding);
    }

    public static class Result {

        private static final int MILLIS_TO_NANOS = 1000 * 1000;

        private final String description;
        private final int repeats;
        private final long[] totalDurationsMs;

        public Result(String description, int repeats, long[] totalDurationsMs, long[] measurementNoiseMs) {
            totalDurationsMs = sortedCopy(totalDurationsMs);
            measurementNoiseMs = sortedCopy(measurementNoiseMs);
            subtractNoise(totalDurationsMs, min(measurementNoiseMs));
            this.description = description;
            this.repeats = repeats;
            this.totalDurationsMs = totalDurationsMs;
        }

        private static long[] sortedCopy(long[] values) {
            values = values.clone();
            Arrays.sort(values);
            return values;
        }

        private static void subtractNoise(long[] totalMs, long noiseMs) {
            for (int i = 0; i < totalMs.length; i++) {
                totalMs[i] -= noiseMs;
            }
        }

        public String getDescription() {
            return description;
        }

        public int getRepeats() {
            return repeats;
        }

        public long getTotalMillis() {
            return median(totalDurationsMs);
        }

        public double getNanos() {
            return actualNanos(median(totalDurationsMs));
        }

        public double getMinNanos() {
            return actualNanos(min(totalDurationsMs));
        }

        public double getMaxNanos() {
            return actualNanos(max(totalDurationsMs));
        }

        private double actualNanos(long millis) {
            return millis * ((double) MILLIS_TO_NANOS / (double) repeats);
        }

        private static long median(long[] longs) {
            return longs[longs.length / 2];
        }

        private static long min(long[] longs) {
            return longs[0];
        }

        private static long max(long[] longs) {
            return longs[longs.length - 1];
        }

        public String toString() {
            NumberFormat nf = getNumberFormat();
            return "Benchmark.Result[" + getDescription() + ": " + nf.format(getNanos()) + " ns ("
                    + getRepeats() + " repeats, " + getTotalMillis() + " ms)]";
        }
    }
}

EDIT: Improved benchmark by adding measurement noise calculation.

On a quick read of that code, the one problem I see is that it doesn’t look like anything is done with x (which I presume is in there to make sure the JIT doesn’t eliminate those loops altogether). Are the current JITs smart enough to see that you’re doing nothing with it? Honestly, for all the claims that stuff like this happens in today’s JVMs I’ve rarely actually noticed it, even in microbenchmarks, but it shouldn’t hurt to add a System.out.println(x) at the end there just to cover your bases!

Actually, if anything I’m surprised by the level of gains achieved by pooling in this case. I thought things were much closer than that these days. It’s unfortunate that there’s no way to do it transparently (I have pages and pages of old code that I’d love to test it out on, but I lack the energy to convert the code by hand and I’m too lazy to write and debug a script to do so) - even C++ handles deallocation of stack variables, it’s kind of a pain to have to handle it ourselves, but if it’s necessary, it’s necessary!

At some point I’ll have to check whether this is faster or slower than inlining float temporaries in some realistic vector manipulation code - I was surprised to find in some previous tests that sometimes it’s actually faster to store your results in objects than in arrays of floats, so I might be surprised here, too.

I think the benchmark had a problem - maybe hashCode() is slow when on invoked the first time on an object. I updated the benchmark and now it looks much better. Doing unbiased benchmarks is hard… :-\

The difference between standard allocation on server and client VM is strange. ???

jdk1.6.0_03 -server


Benchmark Results
1: Standard alloc                8.39 ns    (min 8.15 ns, max 8.39 ns)
2: Vector2DGC: alloc + free      7.45 ns    (min 7.00 ns, max 7.45 ns)
3: Vector2DPool: alloc + free    3.26 ns    (min 3.25 ns, max 3.26 ns)
4: (nothing)                     0.00 ns    (min 0.00 ns, max 0.24 ns)

jdk1.6.0_03 -client


Benchmark Results
1: Standard alloc                 4.66 ns    (min  4.65 ns, max  4.90 ns)
2: Vector2DGC: alloc + free      19.58 ns    (min 19.55 ns, max 20.06 ns)
3: Vector2DPool: alloc + free     8.85 ns    (min  8.40 ns, max  8.88 ns)
4: (nothing)                      0.11 ns    (min  0.00 ns, max  0.11 ns)

When the -verbose:gc VM parameter is used, lots of GC happens during the first test, but no full scans. Most of them take only a couple of milliseconds. Also, does the VM run those GCs in another thread (I have a quad core CPU)?


[GC 3072K->3173K(32768K), 0.0175088 secs]
[GC 6245K->6623K(35840K), 0.0226372 secs]
[GC 12767K->8823K(35840K), 0.0158842 secs]
[GC 14967K->8839K(41984K), 0.0041840 secs]
[GC 21127K->8847K(41984K), 0.0033687 secs]
[GC 21135K->8855K(58752K), 0.0036974 secs]
[GC 33431K->8863K(58752K), 0.0054548 secs]
[GC 33439K->8839K(82176K), 0.0004737 secs]
[GC 57095K->8839K(82304K), 0.0004051 secs]
[GC 57095K->8839K(82240K), 0.0005329 secs]
[GC 57415K->8839K(82432K), 0.0004598 secs]
[GC 57415K->8839K(82496K), 0.0004377 secs]
[GC 57927K->8839K(82624K), 0.0004223 secs]
[GC 57927K->8839K(82752K), 0.0004253 secs]
...


import net.orfjackal.tools.Benchmark;

/**
 * @author Esko Luontola
 * @since 10.3.2008
 */
public class Vector2DBench {

    private static final int SIZE = 100000;

    public static void main(String[] args) throws InterruptedException {

        final Vector2DGC vgc = new Vector2DGC(SIZE);
        final Vector2DPool pool = new Vector2DPool(SIZE);

        Benchmark b = new Benchmark();
        b.runBenchmark("Standard alloc", new Runnable() {
            public void run() {
                Vector2D vector2D = new Vector2D();
            }
        });
        b.runBenchmark("Vector2DGC: alloc + free", new Runnable() {
            public void run() {
                Vector2DGC.Vector2DGC_cell cell = vgc.newVector2D();
                Vector2D vector2D = cell.v;
                vgc.freeVector2D(cell);
            }
        });
        b.runBenchmark("Vector2DPool: alloc + free", new Runnable() {
            public void run() {
                Vector2D vector2D = pool.get();
                pool.free(vector2D);
            }
        });
        b.runBenchmark("(nothing)", new Runnable() {
            public void run() {
            }
        });
        b.printResults();
    }
}

I think you bunch of test code are wrong:

you must use outside of your loop object you allocate, the way you done it they are only know in the main loop and instruction may be removed by compiler, also using Thread without higest priority can give wrong result, last you must do some kind of JVM warmup to ensure that all code get compiled before running. Even if is is simple, I think the first bench is the one who give the smartest results :wink:

EDIT:

[quote]At some point I’ll have to check whether this is faster or slower than inlining float temporaries in some realistic vector manipulation code - I was surprised to find in some previous tests that sometimes it’s actually faster to store your results in objects than in arrays of floats, so I might be surprised here, too
[/quote]
basically random array acces are slow in java while sequential access are “near” as fast as local var storage. another point is that accessing an object attrbute requiere approximativly the same assembler code than accessing an array (without the bounds test in Java) so if compiler is good and i am sure it is, that’s not really surprising.

NB.: even with all possible Java/JVM optimisation allocating an object is and will always be slower than reusing an existing object…

While this is true, think about the following code:


//member vars
public Vec2 position;
public Vec2 velocity;
public Vec2 force;
public float mass;

public void timeStep(float dt){
  position.addLocal( velocity.mul(dt) ); //creates one temp Vec2 due to mul call
  Vec2 acceleration = force.mul( dt / mass ); //another temp Vec2
  velocity.addLocal( acceleration );
}

What would be ideal is that the compiler recognize that none of those temp Vec2s are used for anything except shuffling floats around, and thus the timeStep body could be replaced by the following:


position.x += velocity.x * dt;
position.y += velocity.y * dt;
velocity.x += force.x * dt / mass;
velocity.y += force.y * dt / mass;

This should be faster than even reusing an existing Vec2, and it doesn’t seem like it’s such an outrageous expectation of a decent optimizing compiler (though I happily admit zero knowledge about JIT compiler design), at least assuming some sort of escape analysis is in place.

[quote=“DzzD,post:11,topic:31495”]
The Benchmark class runs each run() method as a warmup in for at least 750ms (which for this short runs means some tens of millions of times), so that it will know how many times to run the code in order to get an accurate measurement (because System.currentTimeMillis() is very inaccurate, maybe only 10ms, the measured time must be long enough). Let’s say that during warmup it will find out that running the run() method 100M times will last over 500ms. Then it will run the code 5 times 100M times and will get 5 measurement results which are converted to nanoseconds (how long one run() takes - the median, min and max times are shown). The time needed for running an empty run() method is subtracted from the measurement results, so as to not calculate the overhead of the measurement loop. Read the code for more details.

The problem is as always, how to make sure that the test code represents an actual use case, and that the JIT compiler does not optimize the test code away. Feel free to prove that Vector2DBench is written incorrectly and write a better version, but in any case the warmup is already taken care of automatically.

[quote]System.currentTimeMillis() is very inaccurate
[/quote]
on windows it is approximativly 16.5ms ticks

as you said innacuracy of currentTimeMillis is not a problem for a test that run for more than 100 ms as innacuracy become <1.6ms and reduce if test is longer.

[quote]results on my AMD 1.7Ghz with Java 1.4.2-02:
Standard alloc time = 609
Stack alloc time = 94
Stack free time = 47
[/quote]
so, this test accuracy is:
Standard alloc time = 609 +/- 2% ===> 598 to 630
Stack alloc time = 94 +/- 17% ==> 80 to 109
Stack free time = 47 +/- 35% ==> 30 to 63

results are accurate enought to show the order of difference between tests.

[quote]The problem is as always, how to make sure that the test code represents an actual use case
[/quote]
in most of case it is necessary to use data manipulated in the test loop ouside of it

ex.:

int y=0;
for(int x=0;x<10000000;x++)
 y+=x;
//print time here

will give a very different result than:

int y=0;
for(int x=0;x<10000000;x++)
 y+=x;
//print time here
System.println(y);

I also guess that

for(int x=0;x<10000000;x++)
 Object v=new Object();

will only allocate one object or zero object

ex:

while(true)	
{
 Object o=new Object();
}

doe not grow take more memory than

Object o=new Object();

When I run the following code with the VM options -server -verbose:gc (Java 1.6.0_03), even after many minutes it reports GCs happening all the time. So in this case the VM does not optimize away the allocation of that Object even though nothing is done with it.


public class AllocTest {

    public void exec() {
        while (true) {
            foo();
        }
    }

    public void foo() {
        new Object();
    }

    public static void main(String[] args) {
        new AllocTest().exec();
    }
}