Minecraft

just a thought about the rendering…

what is the maximum number of boxes that you draw at once ?

I got around 20/30 fps on my computer with an ATI X800, and I was just thinking that using some gpu display list combination may be able to render the whole things pretty fast without any other software side improvments…

the problem is that with java, making pre-oprimisation as octree/bsp or such does not always have the wanted effect as it can have in C/C++ as Java call to Opengl are very slow, I manage to draw arround 80000 faces (so around 7000 boxes) (all front of the screen at 200 fps on my computer with this video card (ATI X800) only by reducing opengl call, all try to increase it using software optimisation only make the fps decrease)

somthing really brut force like : 1 display list per box type & 1 display that call boxes display for the whole world, and finaly only one call to the main display list to draw the whole word and a recompile of the world display list for each change, this way you keep all your CPU free to perform network & physic stuff

all of that IMHO, that’s just a thought…

EDIT:

about the structure as there is usually less vertical entry than horizontal, I would have used a 2d array that lie the ground and some simple list linked or static for all cube vertical/above to that entry this way you get one display list per 2d array entry that represent all cube vertically and can perform extrem fast collision aswell as fast display rebuild (as you only rebuild the modified entry)

with a 2d array for all ground entry you may get :
1 display for cube
1 display for every entry
1 main display that simply call all entry

for drawing : one call to the main display that call every 2d entry display wich call cube display where needed
for modification : you simply recompile the display of the corresponding 2d entry without modifying any other

Let’s see whether that cache locality thingy works… is the overhead small enough?

locality is set to 8x8x8 grids, so 512 bytes. With 16x16x16 you’d get ‘perfect’ page-size (4K) chunks, but byte[] are not page-size aligned, so you’ll probably lose precious performance here.

Y is UP



int localStride = 8*8*8;
int localStrideMulAsShift = 3+3+3;

int xCount  = 1024;
int yCount  = 256;
int zCount  = 1024;

int xShift  = 10 - 3; // (1024 = 2^10) / (8 = 2^3)
int yShift  =  8 - 3; // (256  = 2^8)  / (8 = 2^3)
int zShift  = 10 - 3; // (1024 = 2^10) / (8 = 2^3)
int xzShift = xShift + zShift;

public byte get(byte[] grid, int x, int y, int z)
{
        return grid[this.toIndex(x,y,z)];
}

public void set(byte[] grid, int x, int y, int z, byte cell)
{
        grid[this.toIndex(x,y,z)] = cell;
}

private int toIndex(int x, int y, int z)
{
	int iChunk = (x >> 3);
        iChunk    |= (y >> 3) << xzShift;
        iChunk    |= (z >> 3) << xShift;

        int index =  iChunk << localStrideMulAsShift;
	    index |= ((z & 7) << 3) | ((y & 7) << 6) | (x & 7);
	return index;
}

This code is not tested, in any way, just typed in notepad, so it might as well not compile :wink:

When checking cells on the ‘sub grid edges’ you get less than optimal performance, but it should be much better than the current approach, which is more or less random access for the CPU.

Yet another silly idea… if you’re mainly scanning along 1 axis (that is algorithmically very likely) then you can use a 16x8x8 or 8x8x16, so that you get 1K size subgrids that are in the direction of your ‘inner for-loop’.

Algorithm choice is usually about the “common” case, not the “degenerate case” - pick the right one for what you need to do, and in this case, rendering a world made out of identically sized blocks, the octree is the optimal solution. You then need to tune the depth of the octree for best performance.

“Classification” of objects in the octree is the process of deciding which octant that a thing lives in. Some objects fit nicely entirely inside a node; some objects span two nodes, and are stored in the node above. The worst case is where an object spans every single node right up to the root, and the degenerate case with octrees is where you have a large number of objects which by a quirk of the geometry, all end up in the root node. However, your boxes are the perfect shape and in the perfect distribution for there to never be a need to span a single node with them. You should, in short, end up with the most perfectly optimal usage of the most perfectly optimal data structure for this task. So if it didn’t work out like that - something is probably wrong with the octree code or your usage of it!

Cas :slight_smile:

Stepping through your Octtree is slow, just like stepping through a LinkedList is slow, compared to an ArrayList.

Due to the lack of structs, your Nodes will be all over the heap, just like your arrays holding small parts of the world. Further, in an octtree, the JIT simply can’t remove all those null-checks, because those references often will be null. A conditional check can be as slow as 30 mathematical instructions, and not even counting flushing the CPU pipeline, and should be avoided like the plague - if you are really seeking ultimate performance. Octtrees are littlered with conditional checks, ‘random access’ pointers (very bad cache locality).

So even in ‘normal’ conditions (not as extreme as 8 blocks in the corners like Markus suggested) the overhead of the octtree can simply kill it, even if it is more efficient in theory, the CPU simply isn’t well adapted to it, compared to bruteforce, on neatly compact datastructures like massive arrays.

I know. I was just pointing out that claiming that octtrees are faster than brute force is not a strictly true statement. In my case, it’s darn near degenerate case since the level size isn’t big enough to let huge areas of the map get culled in a single pass, the level layout has to be able to change a lot in a short period of time, and the player is most frequently in the middle of the map.

Test results (that nobody is waiting for ;))

normal grid: 58906ms locality grid: 26688ms normal grid: 59078ms locality grid: 26906ms

The dataset is 1024x256x1024 cells (256MB) so the cache is swamped anyway. You can clearly see that with this locality you eliminate most of the cache misses, yet those remaining, have a disproportional impact, thus giving ‘only’ a factor ~2 performance boost.


public class LocalityGrid
{
   private static final int localStride           = 8 * 8 * 8;
   private static final int localStrideMulAsShift = 3 + 3 + 3;

   private static final int xShift                = 10 - 3;
   private static final int yShift                = 8 - 3;
   private static final int zShift                = 10 - 3;
   private static final int xzShift               = xShift + zShift;

   public static byte[] create()
   {
      return new byte[1 << (xShift + yShift + zShift + localStrideMulAsShift)];
   }

   public static int countNeighbours(byte[] grid, int x, int y, int z)
   {
      int count = 0;
      count += get(grid, x - 1, y - 1, z - 1);
      count += get(grid, x, y - 1, z - 1);
      count += get(grid, x + 1, y - 1, z - 1);
      count += get(grid, x - 1, y, z - 1);
      count += get(grid, x, y, z - 1);
      count += get(grid, x + 1, y, z - 1);
      count += get(grid, x - 1, y + 1, z - 1);
      count += get(grid, x, y + 1, z - 1);
      count += get(grid, x + 1, y + 1, z - 1);

      count += get(grid, x - 1, y - 1, z);
      count += get(grid, x, y - 1, z);
      count += get(grid, x + 1, y - 1, z);
      count += get(grid, x - 1, y, z);
      //count += get(grid, x, y, z);
      count += get(grid, x + 1, y, z);
      count += get(grid, x - 1, y + 1, z);
      count += get(grid, x, y + 1, z);
      count += get(grid, x + 1, y + 1, z);

      count += get(grid, x - 1, y - 1, z + 1);
      count += get(grid, x, y - 1, z - 1);
      count += get(grid, x + 1, y - 1, z + 1);
      count += get(grid, x - 1, y, z + 1);
      count += get(grid, x, y, z + 1);
      count += get(grid, x + 1, y, z + 1);
      count += get(grid, x - 1, y + 1, z + 1);
      count += get(grid, x, y + 1, z + 1);
      count += get(grid, x + 1, y + 1, z + 1);
      return count;
   }

   public static byte get(byte[] grid, int x, int y, int z)
   {
      return grid[toIndex(x, y, z)];
   }

   public static void set(byte[] grid, int x, int y, int z, byte cell)
   {
      grid[toIndex(x, y, z)] = cell;
   }

   private static int toIndex(int x, int y, int z)
   {
      int iChunk = (x >> 3);
      iChunk |= (y >> 3) << xzShift;
      iChunk |= (z >> 3) << xShift;

      int index = iChunk << localStrideMulAsShift;
      index |= ((z & 7) << 3) | ((y & 7) << 6) | (x & 7);
      return index;
   }
}


public class NormalGrid
{
   private static final int xShift  = 10;
   private static final int yShift  = 8;
   private static final int zShift  = 10;
   private static final int xzShift = xShift + zShift;

   public static byte[] create()
   {
      return new byte[1 << (xShift + yShift + zShift)];
   }

   public static int countNeighbours(byte[] grid, int x, int y, int z)
   {
      int count = 0;
      count += get(grid, x - 1, y - 1, z - 1);
      count += get(grid, x, y - 1, z - 1);
      count += get(grid, x + 1, y - 1, z - 1);
      count += get(grid, x - 1, y, z - 1);
      count += get(grid, x, y, z - 1);
      count += get(grid, x + 1, y, z - 1);
      count += get(grid, x - 1, y + 1, z - 1);
      count += get(grid, x, y + 1, z - 1);
      count += get(grid, x + 1, y + 1, z - 1);

      count += get(grid, x - 1, y - 1, z);
      count += get(grid, x, y - 1, z);
      count += get(grid, x + 1, y - 1, z);
      count += get(grid, x - 1, y, z);
      // count += get(grid, x, y, z);
      count += get(grid, x + 1, y, z);
      count += get(grid, x - 1, y + 1, z);
      count += get(grid, x, y + 1, z);
      count += get(grid, x + 1, y + 1, z);

      count += get(grid, x - 1, y - 1, z + 1);
      count += get(grid, x, y - 1, z - 1);
      count += get(grid, x + 1, y - 1, z + 1);
      count += get(grid, x - 1, y, z + 1);
      count += get(grid, x, y, z + 1);
      count += get(grid, x + 1, y, z + 1);
      count += get(grid, x - 1, y + 1, z + 1);
      count += get(grid, x, y + 1, z + 1);
      count += get(grid, x + 1, y + 1, z + 1);
      return count;
   }

   public static byte get(byte[] grid, int x, int y, int z)
   {
      return grid[toIndex(x, y, z)];
   }

   public static void set(byte[] grid, int x, int y, int z, byte cell)
   {
      grid[toIndex(x, y, z)] = cell;
   }

   private static int toIndex(int x, int y, int z)
   {
      int index = x;
      index |= y << xzShift;
      index |= z << xShift;
      return index;
   }

}


public class GridBenchmark
{
   public static void main(String[] args)
   {
      for (int i = 0; i < 4; i++)
      {
         System.out.println("normal grid: " + benchNormalGrid() + "ms");
         System.out.println("locality grid: " + benchLocalityGrid() + "ms");
      }
   }

   private static long benchNormalGrid()
   {
      byte[] grid = NormalGrid.create();

      long t0 = System.currentTimeMillis();
      for (int x = 1; x < 1023; x++)
      {
         for (int y = 1; y < 254; y++)
         {
            for (int z = 1; z < 1023; z++)
            {
               NormalGrid.countNeighbours(grid, x, y, z);
            }
         }
      }
      long t1 = System.currentTimeMillis();

      return t1 - t0;
   }

   private static long benchLocalityGrid()
   {
      byte[] grid = LocalityGrid.create();

      long t0 = System.currentTimeMillis();
      for (int x = 1; x < 1023; x++)
      {
         for (int y = 1; y < 254; y++)
         {
            for (int z = 1; z < 1023; z++)
            {
               LocalityGrid.countNeighbours(grid, x, y, z);
            }
         }
      }
      long t1 = System.currentTimeMillis();

      return t1 - t0;
   }
}

Update:
normal ==> 59s 2x 2x 2 ==> 42s 4x 4x 4 ==> 30s 8x 8x 8 ==> 26s 16x16x16 ==> 23s 32x16x32 ==> 23s 32x32x32 ==> 23s


Update:
normal ==> 59s 2x 2x 2 ==> 39s 4x 4x 4 ==> 26s 8x 8x 8 ==> 24s 16x16x16 ==> 20s 32x32x32 ==> 19s

Optimizing instructions for the JIT… (making any changes, like removing local variables will add 3s!!)


//this:
int a = (...) | (...) | (...);

// is SLOWER than
int a = 0;
a |= (...);
a |= (...);
a |= (...);


   private static final int bits        = 5; // 32x32x32
   private static final int bits_plus_1 = bits + 1;
   private static final int mask        = ~(-1 << bits); // 31
   private static final int strideShift = bits + bits + bits;

   private static final int xBits       = (10 - bits);
   private static final int zBits       = (10 - bits);
   private static final int yBits       = (8 - bits);

   private static final int xShift      = strideShift;
   private static final int zShift      = xBits + strideShift;
   private static final int yShift      = (xBits + zBits) + strideShift;

   public static byte[] create()
   {
      return new byte[1 << (xBits + yBits + zBits + strideShift)];
   }

// set & get

   private static int toIndex(int x, int y, int z)
   {
      int global = 0;
      global |= (x >> bits) << xShift;
      global |= (y >> bits) << yShift;
      global |= (z >> bits) << zShift;

      int local = 0;
      local |= (z & mask) << bits;
      local |= (y & mask) << bits_plus_1;
      local |= (z & mask);

      return global | local;
   }

59s / 19s = factor 3.1 8)

NB : dont know if you already do that but you may remove all boxes that have 6 (plain) neighbords cause they will never be visible, this can ends up to remove half of them.

It baffles me how a JVM engineer could manage to do this.

Cas :slight_smile:

I think it’s about pattern matching (the JIT is nothing more than a pattern matcher…)

int a = ((q>>r)<<s) | ((t>>u)<<v) | ((w>>x)<<y);
might be harder to recognize than:
int a = 0; a |= ((q>>r)<<s); a |= ((t>>u)<<v); a |= ((w>>x)<<y);

But then again… in bytecode they should look very similar…

Indeed, they look identical even in source code to me…

Cas :slight_smile:

Just wanted to add: this discussion is fascinating to read, keep it up guys! 8)

why dont you activate bilinear filtering on the boxes texture , it will look better no ?

Well, it’ll look blurrier, for sure… as for better… I think it seems to be in keeping with the boxes the world is made of. Like each block is made out of hundreds of pixel sized blocks!

Cas :slight_smile:

oups… sry… :persecutioncomplex: seems so logic…

the game needs an option to turn on vsync, my laptop fan starts blowing flames after a bit of play at maximum cpu usage :slight_smile:

Ah, yeah. Blame sun.
Any form of sleeping at all makes the system clock run too fast on some computers, and they’re fine with it.

ah that’s a shame, hopefully they’ll fix it someday.

BTW does it happen with LWJGL’s vsync option too ? Display.setVSyncEnabled(true);
I was under the impression that it was controlled by the gfx driver and so separate from any java timer/sleep, maybe I’m wrong though.

The vsync in LWJGL typically only works fullscreen on Windows. Not sure about other OSes.

I ignore the tiny % of computers with timing problems and use sleep() in Windowed mode.

Cas :slight_smile:

I have trouble to understand how the sleep method can cause CPU intensive usage, could you give a little more information on this issue ? in addition to sleep a Thread.yield could help on that ?

A busy loop around sleep(0) on modern windows based systems, as I recall, calls the HALT instruction. At least it does in Intel systems. Not sure about AMD based ones. In fact the complete details are lost in the mists of time and I might be talking rubbish. But whatever, on some systems this instruction seems to make the clock behave in curious ways, like running twice as fast. It’s not very common though.

Cas :slight_smile: