a general approach for nio buffer pooling

Hi

Riven’s advises helped me a little bit to implement hierarchical streaming but I wonder whether there exists a general approach for NIO buffer pooling. My main concerns are fragmentation and thread safety. I might use only a single pool per thread but how can I handle smartly the problem of fragmentation?

Is sun.nio.ch.Util useful to solve this problem?

IIRC I never published code that does ByteBuffer pooling? I think you refer to this code that does slice a large ByteBuffer into smaller ones (to avoid the 4K overhead per malloc). Once the large ByteBuffer is completely consumed (as in: you can’t slice off the demanded N bytes), it simply allocates another large ByteBuffer and starts slicing that. I let the GC figure out when all the sliced buffers are not referenced any longer, and the large ByteBuffer will automatically be deallocated.

If you however want true pooling, I advice you to make a List[], containing power-of-two sized ByteBuffers:


public class ByteBufferPool {
   
   final List<ByteBuffer>[] potBuffers;

   public ByteBufferPool()
   {
      potBuffers= (List<ByteBuffer>[]) new List[32];
      for (int i = 0; i < potBuffers.length; i++) {
         potBuffers[i] = new ArrayList<ByteBuffer>();
      }
   }
   
   public ByteBuffer aquire(int bytes) {
      int alloc = allocSize(bytes);
      int index = Integer.numberOfTrailingZeros(alloc);
      List<ByteBuffer> list = potBuffers[index];
      
      ByteBuffer bb = list.isEmpty() ? create(alloc) : list.remove(list.size() - 1);
      bb.position(0).limit(bytes);

      // fill with zeroes to ensure deterministic behavior upon handling 'uninitialized' data
      for (int i = 0, n = bb.remaining(); i < n; i++) {
         bb.put(i, (byte) 0);
      }

      return bb;
   }
   
   public void release(ByteBuffer buffer) {
      int alloc = allocSize(buffer.capacity());
      if (buffer.capacity() != alloc) {
         throw new IllegalArgumentException("buffer capacity not a power of two");
      }
      int index = Integer.numberOfTrailingZeros(alloc);
      potBuffers[index].add(buffer);
   }

   public void flush() {
      for (int i = 0; i < potBuffers.length; i++) {
         potBuffers[i].clear();
      }
   }
    
   private static int LARGE_SIZE  = 1024 * 1024;
   private ByteBuffer largeBuffer = malloc(LARGE_SIZE);
   
   private ByteBuffer create(int bytes) {
      if (bytes > LARGE_SIZE)
         return malloc(bytes);
      
      if (bytes > largeBuffer.remaining()) {
         largeBuffer = malloc(LARGE_SIZE);
      }
      
      largeBuffer.limit(largeBuffer.position() + bytes);
      ByteBuffer bb = largeBuffer.slice();
      largeBuffer.position(largeBuffer.limit());      
      return bb;
   }
   
   private static ByteBuffer malloc(int bytes) {
      return ByteBuffer.allocateDirect(bytes).order(ByteOrder.nativeOrder());
   }
   
   private static int allocSize(int bytes) {
      if (bytes <= 0) {
         throw new IllegalArgumentException("attempted to allocate zero bytes");
      }
      return (bytes > 1) ? Integer.highestOneBit(bytes - 1) << 1 : 1;
   }
}


Code is completely untested, not even compiled.

Using POT buffer sizes solves a lot of problems. In case you worry about fragmentation, simply flush() the pool every once in a while, so that the GC can cleanup after you.

Threadsafety is easily obtained by either using ThreadLocals or extending the class in the example code and making all public methods synchronized. Both (more or less) hurt performance, so you might want to read up on lockfree datastructures. Keep in mind though that it’s unlikely that it will be your bottleneck, unless you aquire/release in some inner loop.


   public static ByteBufferPool synced(final Object mutex) {
      if (mutex == null) {
         throw new NullPointerException();
      }

      return new ByteBufferPool() {
         @Override
         public ByteBuffer aquire(int bytes) {
            synchronized (mutex) {
               return super.aquire(bytes);
            }
         }
         
         @Override
         public void release(ByteBuffer buffer) {
            synchronized (mutex) {
               super.release(buffer);
            }
         }
         
         @Override
         public void flush() {
            synchronized (mutex) {
               super.flush();
            }
         }
      };
   }

Hi

Thank you Riven once more.

Maybe I’m silly. I wonder if nio buffer pooling is the right answer to my problems. I have recently thought about a smarter approach. Actually, when a player finishes the first level and wants to go to the second one, some models will be used, some models won’t. Of course, I use direct NIO buffers to store their data (vertices, texture coordinates). My suggestion is this further :

  • just before going to another level, compare the factories used by the previous level and the factories used by the next level
  • ask the factories used in the previous level but not in the next level to explicitly release all native resources (for example by using the cleaners of their direct NIO buffers to destroy them)
  • ask the factories used in the next level but not in the previous level to load their models
  • don’t ask the factories used both in the previous level and in the next level to explicitly release all native resources

I saw many C++ programmers using some kind of resource manager as a sort of catch-all, I would like not to do the same thing. Is my latest suggestion completely stupid?

Buffer pooling is only needed when you have a high allocation rate of small buffers.

A ‘per level’ allocation certainly doesn’t fit this picture. Further, a specialized approach will always be better than a generic one. If you think you can do better, you probably can. But do you actually need any buffer pooling, did it show up as a bottleneck? If so, if the generic solution fast enough anyway? If not, are you willing to actually fix that bottleneck, or work on logic/content that the player actually cares about.

TL;DR:
premature optimization is the premature root of premature evil.

It’s not premature optimization, JFPSM already uses 1.5 GB, I fixed theOutOfMemoryError some years ago by using indirect NIO buffers when direct NIO buffers were not absolutely necessary.

The alpha version of TUER only uses tens of MB, the prebeta version already uses more than 100 MB whereas it has less features :o

If I destroy a NIO buffer, can it cause some trouble with OpenGL if it was used in a VBO (see glBufferData)?

Yes, it can crash the entire process. This is why libraries like JOGL and LWJGL keep a reference to the buffer you passed to those methods. Before that was done, the GC could free that memory, resulting in an access violation in the driver.

Ok. Sven explained to me that he had to modify something. glDeleteBuffers will remove the kept references from the cache so that I can do what I planned. Just calling clear() on an object containing a direct NIO buffer does not guarantee that Java will immediately releases the resources used on the native heap. My approach will probably only work with signed applications.

Thank you for this complementary information. It would be strange that Java would allow to release native memory without permission.

Some programmers of the Netty project tried to allocate and deallocate buffers “manually” with sun.misc.Unsafe but it seems slower than using direct NIO buffers.