How to read a single file efficiently using NIO and multiple threads?

gouessej · October 8, 2010, 3:02pm

Hi

I would like to read big files with several threads without blocking so that each thread can work on a region of the file without locking the whole file. I have looked for solutions for days… A FileChannel has no configureBlocking method, selectors can be used only with socket channels… FileChannel.map() seems to be blocking the whole file too. How can I map a region of a file in memory and create this mapping by using several threads?

Matzon · October 8, 2010, 3:20pm

I would think that it makes the most sense to read the whole file to memory - using a single thread (since the disk will be the bottleneck here). Once its all in memory, slice it into N partitions where N = number of CPU threads.

I don’t think there would be any advantage in having N threads doing random access to a big file, since this would probably cause needless seeking on the disk.

dbotha · October 8, 2010, 3:41pm

Alternatively if the file is really big and you don’t want to read it entirely into memory FileChannels are thread safe. According to the JavaDoc if you use the read methods that take an absolute position the operations will be performed concurrently (provided that the underlying implementation supports this). Thus if you want to go the concurrent route perhaps you could just have each thread read its region using explicit positions. Performance wise this is probably going to be worse than simply reading each region sequentially and passing it off to a another thread for processing. As Matzon says, concurrent reading is probably going to generate a lot of seeking and slow the whole process down.

lhkbob · October 8, 2010, 5:03pm

Try looking into memory mapped files.

Riven · October 8, 2010, 5:08pm

No.

FileChannel.map(FileChannel.MapMode mode, long position, long size)

As lhkbob said, it is exactly what you need. The OS will manage loading and storing for you.

gouessej · October 10, 2010, 2:14pm

But the files are too big to be stored into memory I tried to use a MappedByteBuffer and of course it did not work. Maybe trying to map smalled regions of the files could work.

Windows seems to handle this not as I expected, the second thread is waiting for the end of reading of the first thread

Riven · October 10, 2010, 2:17pm

If you map truely huge files on 32-bit OSes, yes, it won’t work as you run out of virtual memory. On 64 bit OSes, you can map any file into memory, even terabytes big, if your filesystem supports it.

gouessej · October 10, 2010, 2:26pm

I looked at the source code of map:
http://www.docjar.com/html/api/sun/nio/ch/FileChannelImpl.java.html

There is no “synchronized” on the same lock unlike the read() method. I need a huge virtual memory to handle files of several GB, don’t I?

Riven · October 10, 2010, 2:29pm

Well, it’ not like you’ll start to swap if you map a 500GB file. If you have a 64 bit OS, you can map that within a millisecond or so.

And why would you need a synchronisation, if each thread has it’s own mapped byte buffer

lhkbob · October 10, 2010, 7:28pm

I also have to ask what you hope to get out of multi-threading disk access? Pretty much any disk is going to give you serialized input (unless I missed something …). It might fake it and provide bits and pieces of multiple files interleaved (like how old CPUs let you run more than one program). Why not re-work the problem so that the tasks can operate on chunks of the data in a parallel fashion? You have a single thread reading through the file, and when a unit of work is ready to process it sends off to another worker thread.

gouessej · October 11, 2010, 1:13pm

Actually I have to read pieces of data from a BIG file, convert them and write these data into another file. I wanted to use one thread per region of file. I thought it would be possible to read a single file by using several threads.

I don’t need synchronization, I don’t want it; when I found the “synchronized” keyword in the read method, I was disappointed.

cylab · October 11, 2010, 1:31pm

As others pointed out, it would probably slow down the whole process a lot. Just do it sequentially.

Orangy_Tang · October 11, 2010, 1:41pm

Trying to multithread the file IO (especially when you have one big file to chew through and not multiple files) is fundamentally wrong-headed IMHO.

The most efficient solution is probably to have one file input thread (producer thread) reading in chunks of data (say, a few MB big) and pushing them onto a queue. Then have a pool of consumer/worker threads taking file chunks off the queue and processing them in parallel, before handing the output chunks to another file output thread via another queue. Your thread pool would have numCores-2 threads (so on an 8 core machine you’d have one input thread, one output thread and six worker threads).

Check out the java.util.concurrent stuff, it makes this kind of setup easy.

princec · October 11, 2010, 1:50pm

Also, don’t bother with nio, just use random access files.

Cas

Riven · October 11, 2010, 2:22pm

Nobody appreciates the fantastic work the OS does on memory mapped files? No way a ‘regular programmer’ can come up with something more efficient for random access.

The only disadvantage of MappedByteBuffer is that you get a region with a max-length of Integer.MAX_VALUE, so you need an array of MappedByteBuffers to map files bigger than 2GB, and that there is no unmap() which leaves you at the mercy of the GC to close the file handle (or messing with sun.misc.Unsafe to force it).

princec · October 11, 2010, 2:23pm

It’s just complex, finicky, etc. The normal random access file IO stuff will work absolutely splendidly for this work load.

Cas

Riven · October 11, 2010, 3:03pm

True, but performance has rarely been associated with elegance.

This is how you map a 200GB file into memory:

      // 200GB
      long len = 200L * 1024 * 1024 * 1024;
      File file = new File("C:\\huge.dat");

      RandomAccessFile raf = new RandomAccessFile(file, "rw");
      raf.setLength(len);
      FileChannel chan = raf.getChannel();

      long t0 = System.currentTimeMillis();
      
      List<MappedByteBuffer> maps = new ArrayList<MappedByteBuffer>(); 

      long off = 0;
      while (off < len)
      {
         long chunk = Math.min(len - off, Integer.MAX_VALUE);
         MappedByteBuffer map;
         map = chan.map(MapMode.READ_WRITE, off, chunk);
         off += map.capacity();
         maps.add(map);
      }
      raf.close();

      long t1 = System.currentTimeMillis();

      System.out.println("took: " + (t1 - t0) + "ms");

On my mediocre system it takes ~250ms.

Orangy_Tang · October 11, 2010, 3:17pm

Memory mapping is fantastic, but given that gouessej’s requirements are to linearly process a huge input file and produce an output file I’d rather go for the threaded approach and let a single thread chew though the file while the processing is distributed over the spare cores.

Of course gouessej has been horribly vauge on exactly what he’s trying to do, if he genuinely does need random access from multiple threads then I agree that memory mapping is the way to go.

princec · October 11, 2010, 4:56pm

It would then be a particularly unusual problem though, wouldn’t it. I bet your suggestion of 1 reader, n processors and 1 writer and using ordinary file access will be the most efficient and simple solution here.

Cas

gouessej · October 12, 2010, 7:38pm

Riven:

True, but performance has rarely been associated with elegance.

This is how you map a 200GB file into memory:

      // 200GB
      long len = 200L * 1024 * 1024 * 1024;
      File file = new File("C:\\huge.dat");

      RandomAccessFile raf = new RandomAccessFile(file, "rw");
      raf.setLength(len);
      FileChannel chan = raf.getChannel();

      long t0 = System.currentTimeMillis();
      
      List<MappedByteBuffer> maps = new ArrayList<MappedByteBuffer>(); 

      long off = 0;
      while (off < len)
      {
         long chunk = Math.min(len - off, Integer.MAX_VALUE);
         MappedByteBuffer map;
         map = chan.map(MapMode.READ_WRITE, off, chunk);
         off += map.capacity();
         maps.add(map);
      }
      raf.close();

      long t1 = System.currentTimeMillis();

      System.out.println("took: " + (t1 - t0) + "ms");

On my mediocre system it takes ~250ms.

Excellent suggestion thank you very much. I think that we were trying to create too big MappedByteBuffer instances, that is why it was not working.