Optimizing Conway's Game of Life

sci4me · July 24, 2015, 3:57am

Hey ya’ll. So I have implemented Conway’s Game of Life. I have since optimized it as much as I can… but I still want faster. I am well aware of hashlife, but I don’t understand it. Until I can figure hashlife out, I would like to try to improve my current implementation as much as possible.

Here’s my code:


public static class Grid {
  private static final boolean[] LOOKUP_TABLE = new boolean[32];
  
  private static final int STATE_BIT = 4;
  
  static {
    for(byte i = 0; i < Grid.LOOKUP_TABLE.length; i++) {
      final boolean live = i < 16;
      final byte index = setBit(i, STATE_BIT, live);
      final int neighbors = i >= 16 ? i - 16 : i;
      Grid.LOOKUP_TABLE[index] = live && neighbors == 2 || neighbors == 3;
    }
  }
  
  private final int width;
  private final int height;
  private byte[] grid;
  
  public Grid(final int width, final int height) {
    this.width = width;
    this.height = height;
    this.grid = new byte[this.width * this.width];
  }
  
  public int getWidth() {
    return this.width;
  }
  
  public int getHeight() {
    return this.height;
  }
  
  public boolean get(final int x, final int y) {
    return getBit(this.grid[this.encode(x, y)], STATE_BIT);
  }
  
  public void set(final int x, final int y, final boolean state) {
    this.set(this.grid, x, y, state);
  }
  
  public void cycle() {
    byte[] newGrid = new byte[this.width * this.height];
    for(int i = 0; i < this.width; i++) {
      for(int j = 0; j < this.height; j++) {
        final int idx = this.encode(i, j);
        final int value = this.grid[idx];
        if(value != 0)
          this.set(newGrid, i, j, LOOKUP_TABLE[value]);
      }
    }
    this.grid = newGrid;
  }
  
  public void clear() {
    for(int x = 0; x < this.grid.length; x++) {
      this.grid[x] = 0;
    }
  }
  
  private void set(final byte[] grid, final int x, final int y, final boolean state) {
    final int idx = this.encode(x, y);
    final boolean oldState = getBit(grid[idx], STATE_BIT);
    
    if(oldState == state) return;
    
    grid[idx] = setBit(grid[idx], STATE_BIT, state);
    
    for(int i = x - 1; i <= x + 1; i++) {
      for(int j = y - 1; j <= y + 1; j++) {
        if(i == x && j == y) continue;  
        
        int px = i;
        int py = j;

        // wrapping around - I prefer wrapping to infinite world/edged world
        if(i < 0)
          px += this.width;
        if(j < 0)
          py += this.height;
        if(i >= this.width)
          px -= this.width;
        if(j >= this.height)
          py -= this.height; 
        
        final int k = this.encode(px, py);
        final byte pixel = grid[k];
        final byte neighbors = setBit(pixel, STATE_BIT, false);        
        final boolean currentState = getBit(pixel, STATE_BIT);
        
        grid[k] = setBit((byte) (neighbors + (state ? 1 : -1)), STATE_BIT, currentState);
      }
    }
  }
  
  private int encode(final int x, final int y) {
    return y * this.width + x;
  }
  
  private static byte setBit(final byte value, final int bit, final boolean state) {
    if(state)
      return (byte) (value | (1 << bit));
    else
      return (byte) (value & ~(1 << bit));
  }
  
  private static boolean getBit(final byte value, final int bit) {
    return (value & (1 << bit)) != 0;
  }
}

I suppose I could manually inline functions like encode, getBit, and setBit, but I suspect the JVM does that anyway (if not the Java compiler). So, how else could I improve this? Can I make it more cache friendly perhaps?

Let me know what you think of my current implementation and how I can improve it. Thanks!

BurntPizza · July 24, 2015, 4:02am

Non-algorithmic improvements:

Interchange the i, j loops so that the y axis is the outer loop.
If you can, don’t allocate a new large array each cycle, but reuse an old one. Even though alloc/free is supposed to be fast for short-lived objects, I often get sizable gains from pooling large arrays.
clear() could simply be Arrays.fill(grid, 0);

sci4me · July 24, 2015, 4:12am

Can’t believe I missed the y-axis thing.

So, to reuse an old array, should I simply have two arrays in my grid class and a boolean telling which is in use, or should I use something more like an actual object pool?

BurntPizza · July 24, 2015, 4:12am

The bool switch will work.

BurntPizza · July 24, 2015, 4:20am

Also, I might be derping, but is there any reason you’re using the 3rd bit of everything instead of just using 0 = false, any-other-byte (such as 1) = true?
The jit will probably figure it out, but otherwise that’s potentially a lot of unnecessary shifting and masking going on, not to mention code obfuscation.

EDIT: yep, late-night derping.

sci4me · July 24, 2015, 4:24am

Two things: first, I’m not sure why this isn’t working, I’m probably being retarded… can ya look at it:


public static class Grid
    {
        private static final boolean[] LOOKUP_TABLE = new boolean[32];

        private static final int STATE_BIT = 4;

        static
        {
            for (byte i = 0; i < Grid.LOOKUP_TABLE.length; i++)
            {
                final boolean live = i < 16;
                final byte index = setBit(i, STATE_BIT, live);
                final int neighbors = i >= 16 ? i - 16 : i;
                Grid.LOOKUP_TABLE[index] = live && neighbors == 2 || neighbors == 3;
            }
        }

        private final int width;
        private final int height;
        private byte[] gridA;
        private byte[] gridB;
        private boolean useGridB;

        public Grid(final int width, final int height)
        {
            this.width = width;
            this.height = height;
            this.gridA = new byte[this.width * this.height];
            this.gridB = new byte[this.width * this.height];
        }

        public int getWidth()
        {
            return this.width;
        }

        public int getHeight()
        {
            return this.height;
        }

        public boolean get(final int x, final int y)
        {
            if (this.useGridB)
                return getBit(this.gridB[this.encode(x, y)], STATE_BIT);
            return getBit(this.gridA[this.encode(x, y)], STATE_BIT);
        }

        public void set(final int x, final int y, final boolean state)
        {
            if (this.useGridB)
                this.set(this.gridB, x, y, state);
            this.set(this.gridA, x, y, state);
        }

        public void cycle()
        {
            for (int j = 0; j < this.height; j++)
            {
                for (int i = 0; i < this.width; i++)
                {
                    final int idx = this.encode(i, j);
                    final int value = this.useGridB ? this.gridB[idx] : this.gridA[idx];
                    if (value != 0)
                        this.set(this.useGridB ? this.gridA : this.gridB, i, j, LOOKUP_TABLE[value]);
                }
            }
            this.useGridB = !this.useGridB;
        }

        public void clear()
        {
            Arrays.fill(this.gridA, (byte) 0);
            Arrays.fill(this.gridB, (byte) 0);
            this.useGridB = false;
        }

        private void set(final byte[] grid, final int x, final int y, final boolean state)
        {
            final int idx = this.encode(x, y);
            final boolean oldState = getBit(grid[idx], STATE_BIT);

            if (oldState == state) return;

            grid[idx] = setBit(grid[idx], STATE_BIT, state);

            for (int i = x - 1; i <= x + 1; i++)
            {
                for (int j = y - 1; j <= y + 1; j++)
                {
                    if (i == x && j == y) continue;

                    int px = i;
                    int py = j;

                    if (i < 0)
                        px += this.width;
                    if (j < 0)
                        py += this.height;
                    if (i >= this.width)
                        px -= this.width;
                    if (j >= this.height)
                        py -= this.height;

                    final int k = this.encode(px, py);
                    final byte pixel = grid[k];
                    final byte neighbors = setBit(pixel, STATE_BIT, false);
                    final boolean currentState = getBit(pixel, STATE_BIT);

                    grid[k] = setBit((byte) (neighbors + (state ? 1 : -1)), STATE_BIT, currentState);
                }
            }
        }

        private int encode(final int x, final int y)
        {
            return y * this.width + x;
        }

        private static byte setBit(final byte value, final int bit, final boolean state)
        {
            if (state)
                return (byte) (value | (1 << bit));
            else
                return (byte) (value & ~(1 << bit));
        }

        private static boolean getBit(final byte value, final int bit)
        {
            return (value & (1 << bit)) != 0;
        }
    }

Second, to answer your question, what? Uh… The first 4 bits are storing how many neighbors a cell has, the next one is the cell’s state. Then I can just zip through them with the LUT…

Not sure if I properly understood your question tbh

Jesse · July 24, 2015, 4:32am

Profiling is probably the way to go here (optimizing via forum posts is a bit speculative). But, it seems worth mentioning that your new cycle() function has introduced multiple conditionals in the inner loop that weren’t there before, which it seems likely could offset any gains from pooling the grids. Since the results of the conditionals are the same each time though, you should just be able to grab references to the grids at the beginning of the function, at which point your inner loop can go back to the way it was previously.

sci4me · July 24, 2015, 4:33am

Huh. The line:


if(value != 0)

needed to be commented out to fix it.

This may be a surprise, but performance is actually a fair bit worse using two buffers and a boolean switch. Likely due to all the extra branching. I may try another method…

EDIT: ninjad. Yes, you’re right about the conditionals and whatnot. Changing that doesn’t really improve performance.

I think this method just adds too much branching… but maybe a pool world help? Who knows.

As for profiling: not a LOT can be done… I know that the set(byte[], int, int, boolean) method takes the most CPU time… but… eh… how helpful is that?

BurntPizza · July 24, 2015, 4:36am

Lifting the conditionals out of the loop is almost certainly done by the jit, but I agree it would make the code clearer.
See: https://en.wikipedia.org/wiki/Loop-invariant_code_motion

sci4me · July 24, 2015, 4:37am

Double-ninjad. Gosh, ya’ll are too fast for me!

I am going to try a slightly different way to do this… with less branching but still using two arrays.

EDIT: it seems that storing two arrays in the Grid class just isn’t really a good way to do this…

Are we sure we can’t optimize the set(byte[], int, int, boolean) method? If not, I guess this probably can’t really be optimized much.

Jesse · July 24, 2015, 4:45am

Whether you do it or the compiler does it (as BurntPizza mentioned), there should only be a couple branches needed at most, so I don’t think the branches will be a problem. There are other ways to swap the grids though, such as by swapping a pair of indices (untested):

i = (i + 1) % 2;
j = (i + 1) % 2;

As for pooling, I think you only need a pool of two grids, which is what you have now (maybe that’s what you meant). In any case, other than having the conditionals in the inner loop, it seems like you have the right idea with respect to the grid-swapping.

sci4me · July 24, 2015, 4:48am

Jesse:

sci4me:

I think this method just adds too much branching… but maybe a pool world help? Who knows.

Whether you do it or the compiler does it (as BurntPizza mentioned), there should only be a couple branches needed at most, so I don’t think the branches will be a problem. There are other ways to swap the grids though, such as by swapping a pair of indices (untested):
i = (i + 1) % 2;
j = (i + 1) % 2;
As for pooling, I think you only need a pool of two grids, which is what you have now (maybe that’s what you meant). In any case, other than having the conditionals in the inner loop, it seems like you have the right idea with respect to the grid-swapping.

I’m not sure what you mean by that exactly…

I tried having a pointer to each grid and swapping them at the end of cycle(). Er… what do you mean with those indicies?

BurntPizza · July 24, 2015, 4:50am

Yeah, the 2-pool only adds 1-2 conditionals per cycle() after all is said and done.
The (i + 1) % 2 is equivalent to and will probably reduce to the same thing as the flag = !flag.

EDIT: you could still remove the branch by having the two grids in an array themselves: (although I highly doubt saving 1 branch will mean anything…)


byte[][] grids = { gridA, gridB };

...

grid = grids[counter++ & 1];

Swapping global pointers may adversely affect alias analysis, so I would avoid doing that.

The loops in set() could still be interchanged as well.

Jesse · July 24, 2015, 4:51am

It doesn’t matter I was just pointing out that there are ways to do it that don’t involve conditionals. Anyway, swapping references at the end of the cycle as you mentioned in your last post is probably how I’d do it.

[Edit: Looks like BurntPizza offered a reason why swapping the references might not be the best choice, so I’ll take his word on that and withdraw that recommendation.]

sci4me · July 24, 2015, 4:54am

Okay, uh…

I guess I’m kind of confused as to why using two buffers seems to slow the program down… a lot…

Any ideas on this?

BurntPizza · July 24, 2015, 5:00am

If you remove the if(value != 0) from the original code, does it slow down by roughly the same amount?

sci4me · July 24, 2015, 5:04am

To my GREAT surprise, NO.

To my even greater surprise, it doesn’t seem to make THAT much of a difference…
the only difference removing it from the original code makes is that it can’t get to quite the same speed as it could when most of the cells are dead.

BurntPizza · July 24, 2015, 5:10am

Forgot about this:
It’s not semantically equivalent until the grid is getting zeroed somewhere in the process.


public void cycle() {
    byte[] grid = useGridB ? this.gridB : this.gridA;
    byte[] newGrid = useGridB ? this.gridA : this.gridB;
    this.useGridB = !this.useGridB;
+    Arrays.fill(newGrid, (byte) 0); // to make up for not having a freshly allocated (and thus zeroed) array

    for (int j = 0; j < this.height; j++) {
        for (int i = 0; i < this.width; i++) {
            final int idx = this.encode(i, j);
            final int value = grid[idx];
            if (value != 0) // put this back in place, although it sounds like it isn't very beneficial
                this.set(newGrid, i, j, LOOKUP_TABLE[value]);
        }
    }
}

Alternatively newGrid[idx] = 0; in each loop iteration, access patterns between the two arrays might outweigh the vectorizability of fill().

sci4me · July 24, 2015, 5:17am

BurntPizza:

Forgot about this:
It’s not semantically equivalent until the grid is getting zeroed somewhere in the process.


public void cycle() {
    byte[] grid = useGridB ? this.gridB : this.gridA;
    byte[] newGrid = useGridB ? this.gridA : this.gridB;
    this.useGridB = !this.useGridB;
+    Arrays.fill(newGrid, (byte) 0); // to make up for not having a freshly allocated (and thus zeroed) array

    for (int j = 0; j < this.height; j++) {
        for (int i = 0; i < this.width; i++) {
            final int idx = this.encode(i, j);
            final int value = grid[idx];
            if (value != 0) // put this back in place, although it sounds like it isn't very beneficial
                this.set(newGrid, i, j, LOOKUP_TABLE[value]);
        }
    }
}

This still doesn’t even come close to beating my original code!

What this tells me: it must be faster to allocate a new byte[] than to do all the work we have to do in order to switch between the two that we save.

BurntPizza · July 24, 2015, 5:25am

But that’s the weird part: there’s almost no work involved to keep a 2-pool.
Frequent alloc/gc for large chunks of memory (+ zeroing unless your OS keeps pre-zeroed pages, although those will run out) should be much more work.

I can only guess that some optimization analysis (like alias analysis) doesn’t like the buffer swap.
Does the cpu profile change much?