I’m experimenting with the impact of file layout on disk read performance and have achieved some good results, but I can’t pinpoint why one alternative is better than the other.
My data structure consists of a rectangular grid of tiles. Each tile has multiple resolution versions (1024^2, 512^2, 256^2, …). I’ve produced a test dataset of 64x64 tiles with a highest level resolution of 1024^2. This dataset has then been written to disk in two different formats A and B.
File format A creates one file per resolution level. In each file all tiles are written linearized by row.
File 1: Tile1@1024^2, Tile2@1024^2, Tile3@1024^2, …
File 2: Tile1@512^2, Tile2@512^2, Tile3@512^2, …
…
File format B creates multiple files with some aribtrary size limit for each file. Each file contains some of the tiles including all of their levels.
File 1: Tile1@1024^2, Tile1@512^2, …, Tile1@1^2, Tile2@1024^2, Tile2@512^2, …, Tile2@1^2, …
File 2: Tile10@1024^2, Tile10@512^2, …, Tile10@1^2, Tile11@1024^2, Tile11@512^2, …, Tile11@1^2, …
…
In practice this gives me for format A a 16Gb file for the first level, a 4Gb file for the next, … Format B has 64 files of 340Mb each. To see which format performs best I wrote the following bit of benchmark code:
public void readTiles(FileFormat format, byte[][] buffer, int levelCount) throws IOException {
for (int y = 0; y < GRID_HEIGHT; y++) {
for (int x = 0; x < GRID_WIDTH; x++) {
for (int level = 0; level < levelCount; level++) {
int fileIndex = format.getFileForTile(x, y, level);
long offset = format.getOffsetForTile(x, y, level);
RandomAccessFile file = getFile(fileIndex);
file.seek(offset);
int bytesRead = 0;
int bytesToRead = buffer[level].length;
while (bytesToRead > 0) {
int r = file.read(buffer[level], bytesRead, bytesToRead);
if (r == -1) {
throw new EOFException();
} else {
bytesToRead -= r;
bytesRead += r;
}
}
}
}
}
}
After running this test with the two formats I get the following results:
Format A: ~554ms per tile
Format B: ~106ms per tile
Of course the above results are good news, but I can’t seem to figure out why one is so much faster than the other. My intuition tells me this is related to data locality and seek times, but I would like to get some slightly more scientific proof than just my intuition. Profiling with a java profiler shows the read call is the bottleneck in both cases, so that’s not of much use. Does anyone know how I can determine what’s actually going on behind the read call so I can figure out exactly why one is slower than the other?