Programming to the cache

ewjordan · June 22, 2008, 1:04am

I’ve been looking around for a while on this, and haven’t found much info online, so I figured I’d ask here - game programmers tend to be more psychotic than most about speed!

In compiled languages, one of the most important optimizations that can be done is to make sure you program your stuff in a cache friendly manner, especially with processor speed pulling so far ahead of memory access. So you always try to do things like accessing things sequentially instead of randomly to optimize for that.

In theory, the JVM should be optimizing this type of stuff in Java without us needing to worry about it, which is one of the reasons Java could theoretically be faster than C; but in practice, it’s not clear how much this is happening.

So here’s the question: does anyone know if there is anything to be gained from cache-friendly methods in Java, and how best to exploit this, or does the work that the JVM does render these techniques irrelevant? In C++ some people see up to 10x performance increases by appropriately handling memory access, so it could be very worthwhile to figure this out. In some cases I’ve been able to exhibit speed gains by interleaving arrays of built-in types in very careful ways, so I know the JVM isn’t doing everything that’s possible, but does anyone know any best practices that are generally useful? I’d hate to be optimizing for a particular architecture and JVM…

anarchotron · June 22, 2008, 3:57am

I am very interested in this topic as well. I’m mainly a C++ programmer these days and what you say is very true. I’ve never understood how these cache coherency issues translate to Java. Is there any kind of documentation out there on this? Would the jvm spec indicate anything in this regard? Basically the only strategy that I’m sure works is to keep everything as small as possible and as contiguous as possible (e.g. a single FloatBuffer of floats that is used as an n-dimensional array rather than an actual n-dimensional array.

ewjordan · June 22, 2008, 9:14am

My impression is that the JVM specs guarantee literally nothing about any sort of contiguous allocation - by my reading, an array of floats could very well be scattered all across the heap. Apparently even the stack is usually not contiguous in Sun’s VMs.

I would think, though, that the JVM would take the fact that you’re allocating a bunch of floats in an array as a good hint that it’s worth putting them near each other in memory, though. And I have, indeed, been able to show small improvements in performance by carefully interleaving array data from several arrays into one. What I’m not sure about is a) whether this is consistent and exploitable in general, and b) what happens with arrays of objects, and whether there is even any point in trying to do something similar - for instance, will the array of references ever be allocated so that it is interleaved with the referents for better locality?

I’m hoping that someone knows how HotSpot actually handles this stuff in practice, since in reality, we’re normally programming to that implementation, and some rules of thumb would be great. And maybe someone knows whether this stuff changes every version or is stable enough to try to work out.

BTW, re: FloatBuffers, are those faster than a single dimensional float[]? I couldn’t exhibit a simple case where that seemed to be the case. Any chance you’ve got a snippet that could demonstrate a good use?

DzzD · June 22, 2008, 11:57am

this is logical but in case of, for array you must access them sequentially and “incrementally” (from offset 0 to offset max) and they seems to give better result when working with 64k blocks, also use only one dimensional array as multidimensional work differently than in C (multiple block of memory allocated rather than a single one).

avoid casting when it is not simple.
double => int == very slow
int => double acceptable
avoid casting in a lower resolution and from floating to fixed.

for complexe computation do not work directly on array value, keep them local and put them back :

int x=array[i];
x+=y; => Perform some complex ops
...
...
array[i]=x;

when you treat array value you must have only one read/one write per value in a loop, array are slow as there is a bound check for each read/write op.

when it is possible use block of memory in one time, it seems that working on an array than on another array is faster than working in the same time on both array (especially when there is a lot of array) => this is also true for object, i guess it is related to cache.

note that the above are only from my own experience.

EDIT:

[quote]BTW, re: FloatBuffers, are those faster than a single dimensional float[]? I couldn’t exhibit a simple case where that seemed to be the case. Any chance you’ve got a snippet that could demonstrate a good use?
[/quote]
nothing can be faster than a simple float[] array in pure java, FloatBuffers may be faster for exchanging with external/native function or FloatBuffers may have some special function implemented.

princec · June 22, 2008, 12:47pm

See 4820062

Cas

VeaR · June 22, 2008, 9:13pm

The solution with “@Struct” would be absolutely ugly.

Reading the comments, i get the feeling that people actualy dont want games to be written in Java. Conspiracy of the C++ programmers? The JVM is written by C++ programmers isnt it? :persecutioncomplex:

bienator · June 22, 2008, 10:09pm

I think they simple try to prevent a new keyword in the language. I remember as they introduced ‘enum’ and let several projects break including Apache projects and even NetBeans. The annotation based solution is probably the only possible way to get it into a 15 year old language like java.

princec · June 22, 2008, 10:21pm

Nah, that’s just a lame excuse. After all binary compatibility wouldn’t be broken, only source, and how hard is it to do a trivial renaming refactoring on the very rare bits of code with the keyword “struct” in?

Cas

bienator · June 22, 2008, 11:00pm

its actually no work at all to refactor it… in case you have the source and you are allowed to recompile it. They don’t see it as main feature thats why they prevent to cause any trouble with it.
(but to be honest I personally really don’t care if it is called ‘struct’ or ‘@Struct’)

IMO annotations are a pretty cool workaround to introduce language features, you can even make java “feel” like Erlang if you want.

http://www.malhar.net/sriram/kilim/

but thats completely off topic

ewjordan · June 23, 2008, 8:49am

(Edit: posted some timing results, but just realized they might be flawed due to a C++ timing bug, will fix up and repost)

jezek2 · June 23, 2008, 9:44am

Can you also provide results with client compiler? (at least I think you tested only server)

Or is server compiler usable on multicore machine without initial low performance and very visible shuttering due to compilation?

princec · June 23, 2008, 11:14am

Soon to be rendered irrelevant with the combined client+server JVM, so let’s not worry about the difference any more

Cas

ewjordan · June 23, 2008, 7:05pm

Ok, verified the results again, so here’s the post I deleted:

Some initial experimentation shows some interesting things. I ran a simple 1D particle simulation (increment position based on velocity, apply a tiny bit of damping to velocity) in both Java and C++ comparing the behavior depending on whether I had two separate arrays for velocity and position or one interleaved array. I can post the code if anyone cares (and yup, I am very aware of the pitfalls when writing a Java microbenchmark!), but for now I won’t clutter things up. Here are the results, in FPS, higher is better:

(20000 particles over 10000 frames of simulation, 40 tests run and averaged)

C++ Results:
Normal floats: 5950.77
Interleaved floats: 30524.33

Java Results (Apple JVM 1.6, -server - I’ll check this on a Sun VM pretty soon, too):
Normal floats:22105.749484383392,
Interleaved floats: 22242.982339072023,

So by a MASSIVE amount, C++ with separate arrays performs the worst, and we see a 5:1 improvement there by going to interleaved arrays. But in Java, there’s essentially no difference, at least at this particle count - trust me, the sub percentage point difference you see there is essentially just noise.

The real surprising thing (to me, at least), is that at least Apple’s JVM actually does appear to be making good on the claim that it can achieve better cache locality than AOT compiled languages, as the non cache-friendly Java code performs four times as well as the corresponding C++.

I wondered whether the performance difference between Java and C++ could be accounted for by the bounds check, so I inserted a basic bounds check into the C++ (just breaks out if bounds check fails, which is probably not quite as complex or expensive as Java’s):

C++ (with bounds check):
Normal floats: 5956.94
Interleaved floats: 25678.14

No difference at all for the regular float version (two arrays), but brings the interleaved version within spitting distance of the Java one.

Somewhat relevant to Cas’ suggestion, I also tried using an array of Particle objects in Java (just holders for x and vx, in this case):

Java using Particle class: 17139.75

Using an array of objects is still better than the worst C++, but still a ways behind optimized Java; and we haven’t even touched the issue of creating objects on the fly or garbage collecting them, either, which are the real slowdowns that I tend to see in my code.

So there’s no real conclusion here, except that short of getting rid of those bounds checks, pure float arrays in Java are handled about as well as can be expected, and the JVM actually is doing quite a bit of work to keep the cache miss rate low.

This is now totally off topic, but I ran into one thing that I should warn you about - the initial version of the particle update code I wrote read something like this:


for (int i=0; i<particleList.length; ++i) {
	particleList[i].x += .01f * particleList[i].vx;
	particleList[i].vx *= 0.99f;
}

…and I was surprised to find that my timing information was heavily dependent upon the number of frames that I was simulating. But FPS should not depend on the number of frames you check, as long as it’s large, and I was surprised to be hitting a cutoff around 9000 where to push any higher meant an 80% drop in frame rate! (interestingly enough, the interleaved method weathered this drop far better than the others in Java)

The cause: it turns out that numerical underflow will freaking MURDER your program’s performance in either C++ or Java - that vx *= 0.99f line was bringing all the velocities down to the level of underflow after about 9000 frames, and from that point on it’s a major drag on performance. In this case I wasn’t interested in working around that issue, so I just changed the decay factor to 0.999f (putting off the inevitable a little further), but be aware of this in your own code, if you have a large number of variables that are used a lot and could exponentially decay to zero, it might be worth checking them against a threshold.

Things I’ve learned:

The overall conclusion: the JVM does exactly what it promises to, at least with floating point arrays, optimizing them in almost every case to the point that they are similar in performance to well-used C++ arrays.
I could not find a SINGLE way to affect the timing via a local (non-algorithmic) optimization - storing the loop limit away, changing to a while loop, replacing with a try/catch and a while(true), etc. - every one of these things performed EXACTLY the same
final keyword makes no difference at all to performance anymore, anywhere, at any time. Only use it when it “should” be used, not because you think it will make things faster.
Scope does not affect speed within a class (passing an array as a parameter vs. storing it as a member doesn’t make a difference)
Visibility of array does not affect speed
Using getters and setters to marshal access to private object members does not degrade performance
for (Particle p:particles) loop performed exactly the same
Any optimizations that are happening for fast array access do not seem to happen at garbage collection time - performance is identical before a single GC happens
Memory layout management is probably NOT responsible for the fast access - disrupting the access patterns does not seem to make any difference, so the JVM must be doing some intelligent prefetching

@jezek2: on the Mac, the -server mode in 1.6 doesn’t do very much, I think it just changes the way the heap sizing operates, which is irrelevant for these tests. I tried in 1.5, though, and got the following:

Apple 1.5 without -server:
Normal FPS: 7530.917239361543
Interleaved FPS: 7536.807696165959
Object FPS: 6943.34652777474

Apple 1.5 with -server:
Normal FPS: 22029.56632155148
Interleaved FPS: 21603.01266973487
Object FPS: 18347.889937613505
Local FPS: 13744.841045723999

So at least on OS X, 1.5 should DEFINITELY be used with the -server flag. I included that “Local FPS” entry in the last one because that shows that at least Apple’s 1.5 -server has a speed problem passing arrays as local variables, I’m not sure why - I omitted that from all the other tests because it never differed.

I know that most people aren’t using Apple’s VM, so I’m going to boot into Windows right now and see how things work there…I’ll post some more results in half an hour or so.

And btw, I’ve never had problems with stuttering or anything like that due to the server flag, I tend to find that compilations pretty much cease after the first several seconds of gameplay, which you can keep an eye on with the -XX:-PrintCompilation flag - have you had trouble with this in the past?

Riven · June 23, 2008, 7:46pm

Let me start by saying I ready your whole post.

[quote] for (Particle p:particles) loop performed exactly the same
[/quote]
That enhanced-for-loop creates an Iterator, the overhead can be significant in nested loops.

Further, I agree with most of your findings.
I recently began to change my coding where performance really mattered:
With out-of-order CPUs, you can actually do quite a bit while the CPU is waiting for the cache, or worse: for system RAM.

Let’s look at this loop:


long[] data = new long[1024];
long xor = 0;
for(int i=0; i<data.length; i++)
   xor ^= data[i];

It’s obvious that this loop basicly waits a lot of the time. You can do quite a bit meanwhile, in the same thread.

it’s actually [i]so bad[i], that you can do this, in exactly the same time:


long[] data = new long[1024];
long xor = 0;
for(int i=0; i<data.length; i++)
   xor ^= swap64(data[i]);

   public static final long swap64(long val)
   {
      long swapped = 0;
      swapped |= (val & 0x00000000000000FFL) << 56;
      swapped |= (val & 0x000000000000FF00L) << 40;
      swapped |= (val & 0x0000000000FF0000L) << 24;
      swapped |= (val & 0x00000000FF000000L) << 8;
      swapped |= (val & 0x000000FF00000000L) >> 8;
      swapped |= (val & 0x0000FF0000000000L) >> 24;
      swapped |= (val & 0x00FF000000000000L) >> 40;
      swapped |= (val & 0xFF00000000000000L) >>> 56;
      return swapped;
   }

We’re adding maybe 32 instructions per loop-iteration, and we have no speed-penalty at all.

So with this knowledge, I began to interleave my algorithms (very messy) which yielded great (!) performance improvements.

DzzD · June 23, 2008, 7:57pm

/**
 * @(#)Bench.java
 *
 * Simple bench showing difference between acces on Array
 *
 * @version 1.00 08/06/23
 */
 
import java.awt.*;
import java.applet.*;

public class Bench extends Applet {
	
	private long getTime()
	{
		return System.currentTimeMillis();
	}
	
	
	public void init()
	{
		this.testMultiArray();
		this.testSingleArray();
		//Another try
		System.out.println("Another try");
		this.testSingleArray();
		this.testMultiArray();

	}

	private void testSingleArray()
	{
		System.out.println("Single array test");	
		float a1[]=new float[1024*20*1024];
		int nb=0;
		long s=getTime();
		for(int y=0;y<1;y++)
		{
			for(int x=0;x<1024*1024*20;x++)
			{
				a1[x]+=1f;
				nb++;
			}			
		}
		long e=getTime();
		long time=(e-s);
		System.out.println(time+"ms for "+nb+" loops");		
		
	}
		
	private void testMultiArray()
	{
		System.out.println("Multi array test");	
		float a1[][]=new float[1024*20][1024];
		int nb=0;
		long s=getTime();
		for(int y=0;y<1024;y++)
		{
			for(int x=0;x<1024*20;x++)
			{
				a1[x][y]+=1f;
				nb++;
			}			
		}
		long e=getTime();
		long time=(e-s);
		System.out.println(time+"ms for "+nb+" loops");		
		
	}

	public void paint(Graphics g) 
	{
	
	}
}

Results:

Multi array test
3859ms for 20971520 loops
Single array test
218ms for 20971520 loops
Another try
Single array test
234ms for 20971520 loops
Multi array test
3875ms for 20971520 loops

ewjordan · June 23, 2008, 8:33pm

Well, that will teach me to make the assumption that Apple’s VM has anything in common with Sun’s! Here are my results using 6u10 in XP on the same machine:

(still 20000 particles, now over 1000 frames so I don’t drive myself nuts waiting, so bear in mind these results are slightly more variable)
With -server:
Normal FPS: 14993.227896306866
Interleaved FPS: 20555.42182483608
Object FPS: 14358.932379346927
Local FPS (interleaved): 20807.751451388936

Without -server:
Normal FPS: 8315.078683988102
Interleaved FPS: 7129.550702202638
Object FPS: 11372.472596502801
Local FPS (interleaved): 7685.328248452919

So with Sun’s VM, -server is DEFINITELY worth it. Interestingly enough, the performance characteristics are almost exactly inverted between the client and server VMs. Here there does appear to be a difference between using a single interleaved array and two separate ones.

Also, it appears that Apple’s VM is a bit faster at this than Sun’s as of 1.6, delivering more even performance overall.

Since I just noticed that this forum scrolls code posted rather than inlining it, I might as well post mine in case anyone has any beef with it:


public class CacheBenchmark {
	private float[] floats;   //interleaved
	private float[] floatsx;  //normal position
	private float[] floatsvx; //normal velocity
	private Particle[] particleList; //particle holder
	
	private boolean firstTime = true;
	
	private static String dumpString = ""; 

	public static void main(String[] args) {
		CacheBenchmark myBench = new CacheBenchmark();
		
		int warmupiters = 100;
		int iters = 100;
		int frames = 1000;
		int particles = 20000;
		
		double interleavedDiff = 0.0f;
		double normalDiff = 0.0f;
		double objectDiff = 0.0f;
		double localDiff = 0.0f;
        
		long nanos = System.nanoTime();
        long diff = 0;
        System.out.println("Warming up...");
        for (int i=0; i<iters+warmupiters; i++){
        	System.gc();
        	
        	myBench.setupTest(particles);
        	
        	nanos = System.nanoTime();
        	for (int j=0; j<frames; ++j) {
        			myBench.stepFloatsInterleaved();
        	}
        	diff = System.nanoTime() - nanos;
            myBench.dumpResults();
            double fps = frames / ( diff / ((double)1000000000) );
            System.out.println("Interleaved:\t "+fps+",  ");
            if (i > warmupiters) interleavedDiff += diff;
            
            /////////////////////////
            nanos = System.nanoTime();
        	for (int j=0; j<frames; ++j) {
        			myBench.stepFloatsNormal();
        	}
            diff = System.nanoTime() - nanos;
            myBench.dumpResults();
            fps = frames / ( diff / ((double)1000000000) );
            System.out.println("Normal:\t\t "+fps+",  ");
            if (i > warmupiters) normalDiff += diff;

            /////////////////////////
        	
            nanos = System.nanoTime();
        	for (int j=0; j<frames; ++j) {
        			myBench.stepParticles();
        	}
            diff = System.nanoTime() - nanos;
            myBench.dumpResults();

            fps = frames / ( diff / ((double)1000000000) );
            System.out.println("Particles:\t "+fps+",  ");
            if (i > warmupiters) objectDiff += diff;
            
            /////////////////////////
            
            float[] localFloats = new float[2*particles];
            for (int l=0; l<particles; ++l) {
    			float x = l*l*.01f;
    			float vx = l*.1f;
    			localFloats[2*l] = vx;
    			localFloats[2*l+1] = x;
    		}
            
            nanos = System.nanoTime();
        	for (int j=0; j<frames; ++j) {
        			myBench.stepParameterInterleaved(localFloats);
        	}
            diff = System.nanoTime() - nanos;
            myBench.dumpResults();

            fps = frames / ( diff / ((double)1000000000) );
            System.out.println("Local int.:\t "+fps+",  ");
            if (i > warmupiters) localDiff += diff;
            
            

        }
        
        normalDiff /= iters;
        interleavedDiff /= iters;
        objectDiff /= iters;
        localDiff /= iters;
        
        double normalFPS = frames / ( normalDiff / ((double)1000000000));
        double interleavedFPS = frames / ( interleavedDiff / ((double)1000000000));
        double objectFPS = frames / ( objectDiff / ((double)1000000000));
        double localFPS = frames / ( localDiff / ((double)1000000000));
        System.out.println("Normal time elapsed:\t\t"+normalDiff);
        System.out.println("Interleaved time elapsed:\t"+interleavedDiff);
        System.out.println("Object time elapsed:\t\t"+objectDiff);
        System.out.println("Local time elapsed:\t\t"+localDiff);
        System.out.println("Normal FPS:\t\t"+normalFPS);
        System.out.println("Interleaved FPS:\t"+interleavedFPS);
        System.out.println("Object FPS:\t\t"+objectFPS);
        System.out.println("Local FPS:\t\t"+localFPS);
        
        
        System.out.println("\n\nDumping some crap, please ignore:");
        printDump();
	}
	
	public void setupTest(int particles) {
		if (firstTime) {
			floats = new float[2*particles];
			floatsx = new float[particles];
			floatsvx = new float[particles];
			particleList = new Particle[particles];
		}
		
		for (int i=0; i<particles; ++i) {
			float x = i*i*.01f;
			float vx = i*.1f;
			floats[2*i+1] = vx;
			floats[2*i] = x;
			floatsx[i] = x;
			floatsvx[i] = vx;
			if (firstTime) particleList[i] = new Particle();
			particleList[i].setX(x);
			particleList[i].setVx(vx);
		}
		
		firstTime = false;
	}
	
	private void stepFloatsInterleaved() {
		for (int i=0; i<floats.length/2; ++i) {
			int x = 2*i;
			int vx = 2*i+1;
			floats[x] = floats[x] + .01f * floats[vx];
			floats[vx] *= 0.999f;
		}
	}
	
	private void stepFloatsNormal() {
		for (int i=0; i<floatsx.length; ++i) {
			floatsx[i] = floatsx[i] + .01f * floatsvx[i];
			floatsvx[i] *= 0.999f;
		}
	}
	
	private void stepParticles() {
		for (int i=0; i<particleList.length; ++i) {
			particleList[i].setX(particleList[i].getX() + .01f * particleList[i].getVx());
			particleList[i].setVx(particleList[i].getVx() * 0.999f);
		}
	}
	
	private void stepParameterInterleaved(float[] floatArray) {
		for (int i=0; i<floatArray.length/2; ++i) {
			int x = 2*i;
			int vx = 2*i+1;
			floatArray[x] = floatArray[x] + .01f * floatArray[vx];
			floatArray[vx] *= 0.999f;
		}
	}
	
	
	public void dumpResults() {
		for (int i=0; i<floatsx.length; ++i) {
			if (Math.random() < 0.0001) dumpString += floats[2*i]+floats[2*i+1]+floatsx[i]+floatsvx[i]+particleList[i].getX()+particleList[i].getVx();
		}
	}
	
	public static void printDump() {
		System.out.println(dumpString);
	}

}

class Particle{
	private float x;
	private float vx;
	public void setX(float x) {
		this.x = x;
	}
	public float getX() {
		return x;
	}
	public void setVx(float vx) {
		this.vx = vx;
	}
	public float getVx() {
		return vx;
	}
}

@DzzD: There’s one issue with that benchmark, which brings up a very interesting point (maybe that was the thing you were demonstrating?) - you’re running through the multi-array loop in the “wrong” order; if you switch the inner and outer loops in the testMultiArray function, I get the following results:

Multi array test
62ms for 20971520 loops
Single array test
47ms for 20971520 loops
Another try
Single array test
47ms for 20971520 loops
Multi array test
47ms for 20971520 loops

Of course, this difference itself is a cache effect, and shows that the JVM is not handling multidimensional array optimizations very well, nor is it reordering accesses when possible. Try the following altered code and play with xWidth and yWidth to alter the dimensions of the array and see the difference between accessing it the right way and the wrong way, sometimes it is extremely dramatic:


/**
 * @(#)Bench.java
 *
 * Simple bench showing difference between access on Array
 *
 * @version 2.00 08/06/23
 */
 
import java.awt.*;
import java.applet.*;

public class Bench extends Applet {
	
	private final int yWidth = 20; 
	private final int xWidth = 500*1024;
	
	private long getTime()
	{
		return System.currentTimeMillis();
	}
	
	
	public void init()
	{
		this.testMultiArray();
		this.testMultiArray2();
		//Another try
		System.out.println("Another try");
		this.testMultiArray2();
		this.testMultiArray();

	}
		
	private void testMultiArray()
	{
		System.out.println("Badly traversed multi array test");	
		float a1[][]=new float[xWidth][yWidth];
		int nb=0;
		long s=getTime();
		for(int y=0;y<yWidth;y++)
		//for (int x=0; x<1024*20;++x)
		{
			for(int x=0;x<xWidth;x++)
			//for(int y=0;y<1024;++y)
			{
				a1[x][y]+=1f;
				nb++;
			}			
		}
		long e=getTime();
		long time=(e-s);
		System.out.println(time+"ms for "+nb+" loops");		
		
	}
	
	private void testMultiArray2()
	{
		System.out.println("Well traversed multi array test");	
		float a1[][]=new float[xWidth][yWidth];
		int nb=0;
		long s=getTime();
		//for(int y=0;y<2;y++)
		for (int x=0; x<xWidth;++x)
		{
			//for(int x=0;x<1024*20;x++)
			for(int y=0;y<yWidth;++y)
			{
				a1[x][y]+=1f;
				nb++;
			}			
		}
		long e=getTime();
		long time=(e-s);
		System.out.println(time+"ms for "+nb+" loops");		
		
	}

	public void paint(Graphics g) 
	{
	
	}

@Riven: that’s extremely interesting, I’ll have to play around with that type of effect. Do you have any advice as to when you can generally get away with stuff like that?

Riven · June 23, 2008, 10:10pm

You can only do calculations that can operate fully on the registers, ‘while’ waiting for memory I/O.

IIRC, a memory lookup can take up to 40 cycles (depending on memory-speed and CPU architechture) when it’s not in cache. Even cache is slow compared to registers. This is the sole reason why programming in ASM can give such great performance: manually fiddling with registers in a way that compilers simply can’t even get anywhere near. In Java we can only put as much as possible (primitives, not arrays, which involve memory I/O) in local variables and hope for the best.

Riven · June 23, 2008, 10:14pm

I read your sourcecode of your benchmark, and it’s rather poor, unfortunately.

[quote]for (int i=0; i<floatArray.length/2; ++i)
[/quote]
Have you got any idea how long a division takes? It might be like 25-50% of all time spend in that loop.

change it to:

[quote]int end = floatArray.length/2;
for (int i=0; i<end; ++i)
[/quote]
also x*2 is much slower than x<<1

Further, changing the order of the tests in the benchmark will probably have a significant effect on the outcome.

DzzD · June 23, 2008, 10:57pm

I understand the | ^ & part related to ASM but I have trouble to understand where the hang happend ? I mean for the value to be passed to the function it must be first read than the function is called , no ?

anyway it seems very interrestng, can you provide a different sample please for a better understanding ?

ewjordan · June 24, 2008, 4:03am

[quote=“Riven,post:18,topic:31985”]
Ouch. But I took another look, and that’s actually a fair assessment, though only for one of the reasons you posted - the ordering criticism is valid and I definitely should have known better than to ignore it (which does effectively render the whole test moot, so you’re ultimately correct). But for the sake of nitpicking, I have to disagree with the other stuff.

(re: for (int i=0; i<floatArray.length/2; ++i) )

This division, expensive as it might be, only happens once, not every time through the loop, since the JVM can figure out that it’s constant. It might even be converted to a shift. This is actually the first thing I tested (I trimmed all that code out for simplicity’s sake), since these types of optimizations are often the lowest hanging fruit in C. Fortunately (or not, if you’re looking for places to optimize!) inefficiencies in loop handling code seem to be completely optimized away by the JVM, as I mentioned before. You can even do all sorts of nasty stuff like calling simple functions like getters to get the upper range for the loop, and for the most part it adds no overhead as long as the limit is actually constant (there’s probably some trickery you could pull to make the JVM worry that it won’t be, in which case you would be right, but this is not one of those cases).

I tried anyways, though - replacing my original code with yours resulted in no significant difference (I removed all but the interleaved test and ran it separate times to hopefully avoid ordering issues, and got roughly the same results on each run):

Original code: 20565.894852276786 fps
After replacements: 20468.498175720106 fps

Again, not on today’s VMs, especially if the 2 is written as a compile time constant. This is one of the easiest optimizations for a compiler to make, so I would be shocked if the JVM wasn’t doing it. If I’m correct it’s even done in most C/C++ compilers these days, though I haven’t checked that.

(BTW, I’m only making a claim about desktop Java on a recent JIT - on embedded, all this stuff is vital to getting passable performance, and it’s probably increasingly important as you move back from 1.6 even on the desktop)

I threw together a quick benchmark to verify all this. I think I worked around the ordering issue by picking from the two variants (which should be identical apart from your suggested changes, and perform an array copy with a transformation and even/odd flop so as to try and confound any unexpected cleverness from the JVM) randomly, but let me know if you think there’s still likely a bias here:


import java.util.ArrayList;

public class LoopBench {
	private final int length = 100000; 
	private int[] source;
	private int[] dest;
	
	// To store away results so that no dead-code can be eliminated
	private ArrayList<Integer> holder = new ArrayList<Integer>();
	
	public static void main(String[] args) {
		LoopBench myBench = new LoopBench();
		
		int warmupiters = 1000;
		int iters = 10000;
		
		ArrayList<Long> unoptimizedTimes = new ArrayList<Long>();
		ArrayList<Long> optimizedTimes = new ArrayList<Long>();
		
		long nanos = System.nanoTime();
        long diff = 0;
        
        for (int i=0; i<iters+warmupiters; i++){
        	myBench.initialize();
        	
        	boolean doOptimized = false;
        	if (Math.random() >= 0.5) doOptimized = true;
        	
        	if (doOptimized) {
        		nanos = System.nanoTime();
        		myBench.loopOptimized();
        		diff = System.nanoTime() - nanos;
        	} else {
        		nanos = System.nanoTime();
        		myBench.loopUnoptimized();
        		diff = System.nanoTime() - nanos;
        	}
        	myBench.dump();
        	if (i < warmupiters) {
        		System.out.println("Warmup iteration "+i);
        		continue;
        	}
        	
        	if(doOptimized) {
        		optimizedTimes.add(diff);
        		System.out.println("Optimized:\t"+diff);
        	} else {
        		unoptimizedTimes.add(diff);
        		System.out.println("Unoptimized:\t"+diff);
        	}
        }
        
        myBench.dumpHolder();
		
        double optSum = 0.0;
        for (Long l:optimizedTimes) {
        	optSum += l.doubleValue();
        }
        optSum /= optimizedTimes.size();
        
        double unoptSum = 0.0;
        for (Long l:unoptimizedTimes) {
        	unoptSum += l.doubleValue();
        }
        unoptSum /= unoptimizedTimes.size();
        System.out.println("Optimized average: "+optSum + " over "+optimizedTimes.size()+" tests");
        System.out.println("Unoptimized average: "+unoptSum + " over "+unoptimizedTimes.size()+" tests");
        
	}
	
	private void initialize() {
		source = new int[2*length];
		dest = new int[2*length];
		for (int i=0; i<source.length; ++i) {
			source[i] = (int)(1000*Math.random());
		}
	}
	
	// Use all the data, somehow
	private void dump() {
		int x = 0;
		for (int i=0; i<source.length; ++i) {
			x += source[i] - dest[i];
		}
		holder.add(x);
	}
	
	// Dump the useless data to the console
	private void dumpHolder() {
		System.out.println("Useless data:");
		long sum = 0;
		for (Integer i:holder) {
			sum += i.intValue();
		}
		System.out.println(sum);
		System.out.println("End useless data.");
	}
	
	private void loopUnoptimized() {
		for (int i=0; i<getLength()/2; ++i ) {
			dest[2*i] = someFunctionUnoptimized(source[2*i+1]);
			dest[2*i+1] = someFunctionUnoptimized(source[2*i]);			
		}
	}
	
	private void loopOptimized() {
		int limit = source.length >> 1;
		for (int i=0; i<limit; ++i ) {
			dest[(i<<1)] = someFunctionOptimized(source[(i<<1)+1]);
			dest[(i<<1) + 1] = someFunctionOptimized(source[(i<<1)]);			
		}	
	}
	private int getLength() {
		return source.length;
	}
	
	private int someFunctionUnoptimized(int x) {
		return (2*x + 1);
	}
	
	private int someFunctionOptimized(int x) {
		return ((x<<1) + 1);
	}
}

My results (time taken, not fps):
Optimized average: 338101.0054556476 over 4949 tests
Unoptimized average: 339458.2603444862 over 5051 tests

The difference between the results differs from run to run and depending on the size of the array, sometimes putting the optimized version ahead, sometimes the unoptimized one; in any case, I’m not convinced by these results that there’s anything going on here.

Feel free to prove me wrong, though, I’m just going off of what I’ve noticed myself, and you definitely know a lot more about the internals than I do.

There you’re 100% right, and I’m a damn fool since I know that’s always an issue. I thought I had checked for this but I just reordered things manually and the results come out different on Windows (haven’t checked OS X). My bad.

The earlier conclusion (from the OS X tests I ran, not the Windows ones) is actually amplified by this mistake, so I’ll restate it: there seems to be no predictable and exploitable difference between performance of a single interleaved array and two separate arrays, as the performance depends on a lot of other factors that you will likely not have much control over in a live app, and no matter what, as long as you use the server VM, performance beats cache-unfriendly C++ code, so the JVM is doing a lot of lifting even for simple loops.

Which brings me to the next obvious question: is there any logic to how test ordering affects timing results? Since this seems to affect things up to 50% or more, it would be fantastic to have a better idea how to avoid those hits against speed whenever possible, but it initially seems to be totally random - surely there’s some method to the madness?