Vertex cache shenanigans 2 - WTF, Nvidia???

This is a follow-up of this thread.

Now that my exam is over I’ve taken a bit of time to look into this stuff again.

I wrote a simple vertex cache emulator that when given an index list can compute the number of cache misses that would occur if the list was used to render something. It has tweakable cache and batch sizes, so it should work decently, minus possibly the weird results of old-ish AMD cards. However, when comparing the emulator against the number of cache misses reported by using pipeline statistics I noticed discrepancies, sometimes massive ones. I quickly narrowed it down to which older entry to overwrite when the cache is full.

I tried numerous methods:

  • A FIFO cache: the element that was ADDED the longest time ago to the cache was overwritten by the new index.
  • Last-use cache: the element that was USED the longest time ago to the cache was overwritten by the new index.
  • CPU cache-ish: the cache element to overwrite was indexed with (indexValue%cachesize).
  • CPU cache-ish 2: the cache element to overwrite was indexed with (indexElement%cachesize).

All of these systems were incorrect in some case. Hence I developed a small “algorithm” for checking which values were in the hardware cache:

  1. Draw N triangles of 3 indices each and check how many cache misses occurs using pipeline statistics.
  2. For each possible index I, draw the original N triangles plus the triangle (I, I, I) (same index 3 times). If the number of cache misses is the same, I was in the cache.

My GPU has a cache size of 32 entries. Drawing 10 triangles (30 vertices) with the indices (0, 1, 2, …, 29) makes the cache results look like this:


GPU:      [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29]
Emulated: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, -1, -1]

What happens when we draw 11 triangles with 33 different indices in total?


GPU:      [30, 31, 32]
Emulated: [32, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31]

Wow. So basically, if <number of cache misses in triangle (0 to 3)> is more than what would fit in the cache, the entire cache is cleared, then the new entries are inserted. It doesn’t even f**king try to overwrite an old index. It just starts a new cache. Wow. Just wow.

EDIT: This just in: It DOES clear the vertex cache. If a vertex isn’t touched for 16 triangles (48 indices), it is cleared from the cache. This is getting hard as hell to code.

Ooooook… So there doesn’t actually seem to be a batch size at all. In other words, Nvidia doesn’t process 96 indices per block, then clears the cache between each block. Basically, forget everything I’ve said so far. Here is the pattern I’ve found:

  • Nvidia’s cache size is 32.

  • Writing is done from the first element in the cache to the last element in order.

  • If the cache misses caused by a triangle (0 to 3) would cause it to write beyond the end of the cache, the entire cache is cleared. All unique vertices become cache misses and are then placed as elements 0 to (unique_verts - 1) in the cleared cache.
    - Example 1: If the cache is full and all 3 vertices of the next triangle are in the cache, it will not reset the cache.
    - Example 2: If only one free element is left in the cache and a triangle with 2 misses is drawn, then the cache is reset, all 3 vertices become cache misses and are placed in the cache.
    - Example 3: If the cache is full, contains vertex 0 and the degenerate triangle (0, 0, 1) is drawn, then the cache is cleared, 0 and 1 become cache misses and are placed in the cache.

  • Cache entries that are not used by any triangle become invalid after 16 triangles (48 indices) after their last use.

  • All cache entries have a maximum lifetime of 32 triangles (96 indices) after they were originally placed in the cache, no matter how many times they are used within that time.

  • Note that when a cache entry becomes invalid, it is not possible to write a new vertex to it until the entire cache is reset due to overflow. The entries simply are not usable for matching anymore, making the cache effectively smaller until reset.

  • The lifetimes of the vertices in the cache is based on the number of vertices of the geometry being drawn. For points, the entries become invalid after 16 points (16 indices) after their most recent use, or 32 points (32 indices) after they were first used. For lines, it’s 16/32 lines (32/64 indices). For triangles, it’s 16/32 triangles (48/96 indices).

Examples

If you draw 100 identical degenerate (5, 5, 5) triangles, 5 will be placed in the first slot (slot 0) of the cache. After 32 triangles, slot 0 becomes invalid. For the 33rd triangle, slot 0 will be unusable and 5 will again be placed in the cache, this time in slot 1. After 32*32=1024 triangles, the entire cache has slowly been filled with 5s that have gone invalid, and the last slot finally becomes invalid. When the 1025th triangle is drawn, the entire cache is reset and 5 is placed in slot 0 again.

If you draw the triangle (0, 1, 2), then repeatedly draw (0, 0, 0) triangles, the entries for 1 and 2 will become invalid after only 16 triangles as they are not used. This 16-triangle counter is reset each time the vertex is reused in a triangle, but regardless of if they’re used or not they will become invalid after 32 triangles. In other words, 1 and 2 become invalid by the 17th triangle, and 0 becomes invalid by the 32th triangle.

Following all these rules I seem to be able to exactly simulate the results of every single index list I’ve tested so far.

Do you have any insight as to why they implemented it this way? ???

I’d say that vertex-caches are never overwritten, because the GPU is this async machine that will be reading that vertex data long after the indexed geometry is processed by the vertex-shader. Allocating a new vertex-cache upon filling one seems almost the only option…?