Vertex cache shenanigans 2 - WTF, Nvidia???

This is a follow-up of this thread.

Now that my exam is over I’ve taken a bit of time to look into this stuff again.

I wrote a simple vertex cache emulator that when given an index list can compute the number of cache misses that would occur if the list was used to render something. It has tweakable cache and batch sizes, so it should work decently, minus possibly the weird results of old-ish AMD cards. However, when comparing the emulator against the number of cache misses reported by using pipeline statistics I noticed discrepancies, sometimes massive ones. I quickly narrowed it down to which older entry to overwrite when the cache is full.

I tried numerous methods:

  • A FIFO cache: the element that was ADDED the longest time ago to the cache was overwritten by the new index.
  • Last-use cache: the element that was USED the longest time ago to the cache was overwritten by the new index.
  • CPU cache-ish: the cache element to overwrite was indexed with (indexValue%cachesize).
  • CPU cache-ish 2: the cache element to overwrite was indexed with (indexElement%cachesize).

All of these systems were incorrect in some case. Hence I developed a small “algorithm” for checking which values were in the hardware cache:

  1. Draw N triangles of 3 indices each and check how many cache misses occurs using pipeline statistics.
  2. For each possible index I, draw the original N triangles plus the triangle (I, I, I) (same index 3 times). If the number of cache misses is the same, I was in the cache.

My GPU has a cache size of 32 entries. Drawing 10 triangles (30 vertices) with the indices (0, 1, 2, …, 29) makes the cache results look like this:


GPU:      [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29]
Emulated: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, -1, -1]

What happens when we draw 11 triangles with 33 different indices in total?


GPU:      [30, 31, 32]
Emulated: [32, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31]

Wow. So basically, if <number of cache misses in triangle (0 to 3)> is more than what would fit in the cache, the entire cache is cleared, then the new entries are inserted. It doesn’t even f**king try to overwrite an old index. It just starts a new cache. Wow. Just wow.

EDIT: This just in: It DOES clear the vertex cache. If a vertex isn’t touched for 16 triangles (48 indices), it is cleared from the cache. This is getting hard as hell to code.