Slow array filling

Hi all,

please have a look at this great piece of code:


public class ArrayTest {

   private static int length=196608;
   private static int[] pixelArray=new int[length];
   private static int[] zbufferArray=new int[length];
   private static int color=0;

   public static void main(String[] args) {
      do {
         long start=System.currentTimeMillis();
         for (int z=0; z<100; z++) {
           clearArray(color);
         }
         System.out.println(System.currentTimeMillis()-start);
      } while (true);
   }

   private static void clearArray(int color) {
      /*
      for (int i=0; i<length; i++) {
            pixelArray[i]=color;
            zbufferArray[i]=-2147483647;
      }
      */

      for (int i=0; i<length; i++) {
            pixelArray[i]=color;
      }

      for (int i=0; i<length; i++) {
            zbufferArray[i]=-2147483647;
      }
   }
}

When running this on a P4HT@3.2Ghz using the 1.4.2 VM in client mode, each loop takes around 180ms. When using the commented-out array-filling instead (the one that fills both arrays in one loop), i’m at 450ms. But it’s getting even more strange: This test is not a real-world-app (of course not…), but my software renderer is and it’s basically doing the same thing. In that application, i used to use the version with the single loop and it starts fast (like the 180ms version) but it drops to the 450ms performance after some seconds. It doesn’t do this on my AthlonXP 2600+ machine (same OS (XP) and VM). And to complete the wiredness: On this machine, the version with the single loop is faster than the splitted one.

To summerize this:

P4HT/1.4.2/single loop: 450ms
P4HT/1.4.2/two loops: 180ms
XP2600+/1.4.2/single loop: 180ms
XP2600+/1.4.2/two loops: 230ms

I’ve one question: WHY? And why does the P4 starts fast (so obviously, it can run it fast…) when i’m doing this in the actual renderer but drops after some seconds?

BTW: -server mode doesn’t help. It’s a bit faster, but the behaviour is the same.

I tried your code with IBM Jre 1.4.0 for Windows :

                              Two loops           One loop

Pentium 4 1.5 Ghz 160 ms 950 ms

Then I ported ArrayTest to C++, compiled with MS Visual C++ 6.0 (full optimization) :

                              Two loops           One loop

Pentium 4 1.5 Ghz 160 ms 1101 ms

Probably, It’s due to memory access.

                     Ciao

but it drops to the 450ms performance after some seconds

Hm… so you have to refill it again and again with that numbers?

If so System.arraycopy might be worth a try. Obviously it will need more ram (since you have everything twice) but that shouldnt be a problem right?

edit: heh ok… it’s slower :stuck_out_tongue:

It’s somewere between two loops (fast) and one loop (slow).

Tried it on 1.4.1_b21 on 1 GHz P3 win2k

Two loops: 1041
One loop: 1072

Also tried with jview:

Two loops: 1042
One loop: 1232

The results are very much alike in my situation with one loop being slightly slower.

I don’t see how this could be because of memory access since there’s more memory access with 2 loops.

I have no clue why 2 loops is faster :-/ (even though on my system it doesn’t make much of a difference)

Ran the test on a P4@2.4Ghz and a 1.4.1 VM and it took around 380ms for both versions. The same for the 1.3.1 VM…
so basically, i tend to say that it’s a problem with 1.4.2 on P4, but that doesn’t explain the C++ results. And it doesn’t really explain why my actual application that’s using these loops behaves slightly different.
Right now, i offer a method that the user of the API can call to determine which way is the fastest on the current machine, but i’m quite unhappy with this. On the other hand, i don’t want to ignore the problem, because we are talking about the difference between 52 and 40fps here…

This is probably due to ‘cache trashing’.

In the single loop case, you’re alternating between arrays, which causes a cache fault, forcing it to both save out the changes to the first array, then load up the second. In the two loop case, a single array stays in the cache until it’s done with, resulting in less hits to the system memory.

So what you’re basically seeing is the difference between direct memory access and cached memory access.

It really seems to be a problem with memory access…with alignment to be exact.
Adding this line

private static int[] dummy=new int[2];

between the pixel and the zbuffer-array improves performance from 450ms to 200ms. That’s fine for this test, but i can’t do that for the application, because the pixels-array is part of a BufferedImage while the zbuffer isn’t. This sux somehow… >:(

Without having any insight into that topic…

Might it be possible that some bounds-check-elimimation doesn’t work in the one-loop construct?

[quote]Might it be possible that some bounds-check-elimimation doesn’t work in the one-loop construct?
[/quote]
I go for the mem. alignment theory myself :slight_smile:
I thought of bounds check elimination too, but that doesn’t explain jview’s results which I think doesn’t have any bounds check elimination. And I don’t think it could make such a large difference.

How would the length of the array affect the performance ?

  • in terms of overflowing the cache …?

I got ~450 and ~990 on a P4 1.6GHz. When I changed the
array size from 196608 to 496608, the times were nearly
identical - ~1100.

Food for thought…

Unlikely, because the performance you are getting is quite good. However, changing the length of an array may change its alignment…who knows what the VM’s memory management is doing there. I’ve found a bug report for a 1.4.beta VM that doubles were not aligned correctly on the stack. Maybe this is a similar problem that hurts P4s more than Athlons or P3s. The P4 IS quite sensible to incorrect alignment…have a look here (for example): http://gcc.gnu.org/ml/gcc-bugs/2001-07/msg01255.html

@EgonOlsen:

Oh yes, possibly I may have tripped into the alignment problem
when I changed the array size.

The 1.4 double alignment problem, I thought, went away with
the release of Hopper - though I’ve heard also reports to the
contrary. (Bruce) Walter had highlighted the problem well in his
website.

I’ve come across a report which talks about P4’s bad performance
with unaligned reals. Wouln’t the JVM have “compiler” instructions
for optimizing the alignment, though !

on my win98 PIII 400 Mhz the result are the same for both loops…