Vertex cache shenanigans

The results are identical on each run.

-ClaasJG

ClaesJG, can you try this jar out: https://drive.google.com/open?id=0B0dJlB1tP0QZUDk0QW1xSHNsRXM. It has an increased test size to hopefully give more accurate results, but it may take some time to complete. Thanks a lot for testing!!! This is extremely interesting!

Batch size test invocations: 130056 / 50331648
Calculated vertex cache batch size: 387

Cache size 1 invocation test: 130056 / 50331648
Cache size 2 invocation test: 260112 / 50331648
Cache size 3 invocation test: 390168 / 50331648
Cache size 4 invocation test: 520224 / 50331648
Cache size 5 invocation test: 650280 / 50331648
Cache size 6 invocation test: 780336 / 50331648
Cache size 7 invocation test: 910392 / 50331648
Cache size 8 invocation test: 1040448 / 50331648
Cache size 9 invocation test: 1170504 / 50331648
Cache size 10 invocation test: 1300560 / 50331648
Cache size 11 invocation test: 1430616 / 50331648
Cache size 12 invocation test: 1560672 / 50331648
Cache size 13 invocation test: 1690728 / 50331648
Cache size 14 invocation test: 1820784 / 50331648
Cache size 15 invocation test: 6331592 / 50331648
Cache size 16 invocation test: 11184834 / 50331648
Cache size 17 invocation test: 49414172 / 50331648
Cache size 18 invocation test: 49938444 / 50331648
Cache size 19 invocation test: 49610774 / 50331648
Cache size 20 invocation test: 49545240 / 50331648
Cache size 21 invocation test: 49741842 / 50331648
Cache size 22 invocation test: 50200580 / 50331648
Cache size 23 invocation test: 49414172 / 50331648
Cache size 24 invocation test: 50331648 / 50331648

Results:
  Renderer: AMD Radeon HD 7800 Series
  Calculated vertex cache batch size: 387
  Cache size: 23

here we go :slight_smile:

-ClaasJG

Integrated Graphics of i7 4720HQ:


Error: Pipeline statistics are not supported. Aborting.
  Renderer: Intel(R) HD Graphics 4600

GTX 980M:


Batch size test invocations: 32768 / 3145728
Calculated vertex cache batch size: 96

Cache size 1 invocation test: 32768 / 3145728
Cache size 2 invocation test: 65536 / 3145728
Cache size 3 invocation test: 98304 / 3145728
Cache size 4 invocation test: 131072 / 3145728
Cache size 5 invocation test: 163840 / 3145728
Cache size 6 invocation test: 196608 / 3145728
Cache size 7 invocation test: 229376 / 3145728
Cache size 8 invocation test: 262144 / 3145728
Cache size 9 invocation test: 294912 / 3145728
Cache size 10 invocation test: 327680 / 3145728
Cache size 11 invocation test: 360448 / 3145728
Cache size 12 invocation test: 393216 / 3145728
Cache size 13 invocation test: 425984 / 3145728
Cache size 14 invocation test: 458752 / 3145728
Cache size 15 invocation test: 491520 / 3145728
Cache size 16 invocation test: 524288 / 3145728
Cache size 17 invocation test: 557056 / 3145728
Cache size 18 invocation test: 589824 / 3145728
Cache size 19 invocation test: 622592 / 3145728
Cache size 20 invocation test: 655360 / 3145728
Cache size 21 invocation test: 688128 / 3145728
Cache size 22 invocation test: 720896 / 3145728
Cache size 23 invocation test: 753664 / 3145728
Cache size 24 invocation test: 786432 / 3145728
Cache size 25 invocation test: 819200 / 3145728
Cache size 26 invocation test: 851968 / 3145728
Cache size 27 invocation test: 884736 / 3145728
Cache size 28 invocation test: 917504 / 3145728
Cache size 29 invocation test: 950272 / 3145728
Cache size 30 invocation test: 983040 / 3145728
Cache size 31 invocation test: 1015808 / 3145728
Cache size 32 invocation test: 1048576 / 3145728
Cache size 33 invocation test: 3145728 / 3145728

Results:
  Renderer: GeForce GTX 980M/PCIe/SSE2
  Calculated vertex cache batch size: 96
  Cache size: 32

I don’t have an AMD 8)

Some errors but seemed to run on an AMD APU

Vertex shader log:
0:1(10): error: GLSL 3.30 is not supported. Supported versions are: 1.10, 1.20, 1.30, 1.00 ES, and 3.00 ES

Fragment shader log:
0:1(10): error: GLSL 3.30 is not supported. Supported versions are: 1.10, 1.20, 1.30, 1.00 ES, and 3.00 ES

Link log:
error: linking with uncompiled shadererror: linking with uncompiled shader
Batch size test invocations: 8129 / 3145728
Calculated vertex cache batch size: 387

Cache size 1 invocation test: 8129 / 3145728
Cache size 2 invocation test: 16258 / 3145728
Cache size 3 invocation test: 24387 / 3145728
Cache size 4 invocation test: 32516 / 3145728
Cache size 5 invocation test: 40645 / 3145728
Cache size 6 invocation test: 48774 / 3145728
Cache size 7 invocation test: 56903 / 3145728
Cache size 8 invocation test: 65032 / 3145728
Cache size 9 invocation test: 73161 / 3145728
Cache size 10 invocation test: 81290 / 3145728
Cache size 11 invocation test: 89419 / 3145728
Cache size 12 invocation test: 97548 / 3145728
Cache size 13 invocation test: 105677 / 3145728
Cache size 14 invocation test: 113806 / 3145728
Cache size 15 invocation test: 398303 / 3145728
Cache size 16 invocation test: 699062 / 3145728
Cache size 17 invocation test: 3145728 / 3145728

Results:
  Renderer: Gallium 0.4 on AMD KAVERI (DRM 2.43.0, LLVM 3.8.0)
  Calculated vertex cache batch size: 387
  Cache size: 16
Batch size test invocations: 130056 / 50331648
Calculated vertex cache batch size: 387

Cache size 1 invocation test: 130056 / 50331648
Cache size 2 invocation test: 260112 / 50331648
Cache size 3 invocation test: 390168 / 50331648
Cache size 4 invocation test: 520224 / 50331648
Cache size 5 invocation test: 650280 / 50331648
Cache size 6 invocation test: 780336 / 50331648
Cache size 7 invocation test: 910392 / 50331648
Cache size 8 invocation test: 1040448 / 50331648
Cache size 9 invocation test: 1170504 / 50331648
Cache size 10 invocation test: 1300560 / 50331648
Cache size 11 invocation test: 1430616 / 50331648
Cache size 12 invocation test: 1560672 / 50331648
Cache size 13 invocation test: 1690728 / 50331648
Cache size 14 invocation test: 1820784 / 50331648
Cache size 15 invocation test: 6372742 / 50331648
Cache size 16 invocation test: 11184822 / 50331648
Cache size 17 invocation test: 50331648 / 50331648

Results:
  Renderer: AMD Radeon HD 5800 Series
  Calculated vertex cache batch size: 387
  Cache size: 16

AMD HD 5870 1GB. (Crimson 16.2.1)

Some results from Intel on Linux


Vertex shader log:
0:1(10): error: GLSL 3.30 is not supported. Supported versions are: 1.10, 1.20, 1.30, 1.00 ES, 3.00 ES, and 3.10 ES

Fragment shader log:
0:1(10): error: GLSL 3.30 is not supported. Supported versions are: 1.10, 1.20, 1.30, 1.00 ES, 3.00 ES, and 3.10 ES

Link log:
error: linking with uncompiled shadererror: linking with uncompiled shader
Batch size test invocations: 1 / 3145728
Calculated vertex cache batch size: 3145728

Cache size 1 invocation test: 1 / 3145728
Cache size 2 invocation test: 2 / 3145728
Cache size 3 invocation test: 3 / 3145728
Cache size 4 invocation test: 4 / 3145728
Cache size 5 invocation test: 5 / 3145728
Cache size 6 invocation test: 6 / 3145728
Cache size 7 invocation test: 7 / 3145728
Cache size 8 invocation test: 8 / 3145728
Cache size 9 invocation test: 9 / 3145728
Cache size 10 invocation test: 10 / 3145728
Cache size 11 invocation test: 11 / 3145728
Cache size 12 invocation test: 12 / 3145728
Cache size 13 invocation test: 13 / 3145728
Cache size 14 invocation test: 14 / 3145728
Cache size 15 invocation test: 15 / 3145728
Cache size 16 invocation test: 16 / 3145728
Cache size 17 invocation test: 17 / 3145728
Cache size 18 invocation test: 18 / 3145728
Cache size 19 invocation test: 19 / 3145728
Cache size 20 invocation test: 20 / 3145728
Cache size 21 invocation test: 21 / 3145728
Cache size 22 invocation test: 22 / 3145728
Cache size 23 invocation test: 23 / 3145728
Cache size 24 invocation test: 24 / 3145728
Cache size 25 invocation test: 25 / 3145728
Cache size 26 invocation test: 26 / 3145728
Cache size 27 invocation test: 27 / 3145728
Cache size 28 invocation test: 28 / 3145728
Cache size 29 invocation test: 29 / 3145728
Cache size 30 invocation test: 30 / 3145728
Cache size 31 invocation test: 31 / 3145728
Cache size 32 invocation test: 32 / 3145728
Cache size 33 invocation test: 33 / 3145728
Cache size 34 invocation test: 34 / 3145728
Cache size 35 invocation test: 35 / 3145728
Cache size 36 invocation test: 36 / 3145728
Cache size 37 invocation test: 37 / 3145728
Cache size 38 invocation test: 38 / 3145728
Cache size 39 invocation test: 39 / 3145728
Cache size 40 invocation test: 40 / 3145728
Cache size 41 invocation test: 41 / 3145728
Cache size 42 invocation test: 42 / 3145728
Cache size 43 invocation test: 43 / 3145728
Cache size 44 invocation test: 44 / 3145728
Cache size 45 invocation test: 45 / 3145728
Cache size 46 invocation test: 46 / 3145728
Cache size 47 invocation test: 47 / 3145728
Cache size 48 invocation test: 48 / 3145728
Cache size 49 invocation test: 49 / 3145728
Cache size 50 invocation test: 50 / 3145728
Cache size 51 invocation test: 51 / 3145728
Cache size 52 invocation test: 52 / 3145728
Cache size 53 invocation test: 53 / 3145728
Cache size 54 invocation test: 54 / 3145728
Cache size 55 invocation test: 55 / 3145728
Cache size 56 invocation test: 56 / 3145728
Cache size 57 invocation test: 57 / 3145728
Cache size 58 invocation test: 58 / 3145728
Cache size 59 invocation test: 59 / 3145728
Cache size 60 invocation test: 60 / 3145728
Cache size 61 invocation test: 61 / 3145728
Cache size 62 invocation test: 62 / 3145728
Cache size 63 invocation test: 63 / 3145728
Cache size 64 invocation test: 64 / 3145728
Cache size 65 invocation test: 65 / 3145728
Cache size 66 invocation test: 66 / 3145728
Cache size 67 invocation test: 67 / 3145728
Cache size 68 invocation test: 68 / 3145728
Cache size 69 invocation test: 69 / 3145728
Cache size 70 invocation test: 70 / 3145728
Cache size 71 invocation test: 71 / 3145728
Cache size 72 invocation test: 72 / 3145728
Cache size 73 invocation test: 73 / 3145728
Cache size 74 invocation test: 74 / 3145728
Cache size 75 invocation test: 75 / 3145728
Cache size 76 invocation test: 76 / 3145728
Cache size 77 invocation test: 77 / 3145728
Cache size 78 invocation test: 78 / 3145728
Cache size 79 invocation test: 79 / 3145728
Cache size 80 invocation test: 80 / 3145728
Cache size 81 invocation test: 81 / 3145728
Cache size 82 invocation test: 82 / 3145728
Cache size 83 invocation test: 83 / 3145728
Cache size 84 invocation test: 84 / 3145728
Cache size 85 invocation test: 85 / 3145728
Cache size 86 invocation test: 86 / 3145728
Cache size 87 invocation test: 87 / 3145728
Cache size 88 invocation test: 88 / 3145728
Cache size 89 invocation test: 89 / 3145728
Cache size 90 invocation test: 90 / 3145728
Cache size 91 invocation test: 91 / 3145728
Cache size 92 invocation test: 92 / 3145728
Cache size 93 invocation test: 93 / 3145728
Cache size 94 invocation test: 94 / 3145728
Cache size 95 invocation test: 95 / 3145728
Cache size 96 invocation test: 96 / 3145728
Cache size 97 invocation test: 97 / 3145728
Cache size 98 invocation test: 98 / 3145728
Cache size 99 invocation test: 99 / 3145728
Cache size 100 invocation test: 100 / 3145728
Cache size 101 invocation test: 101 / 3145728
Cache size 102 invocation test: 102 / 3145728
Cache size 103 invocation test: 103 / 3145728
Cache size 104 invocation test: 104 / 3145728
Cache size 105 invocation test: 105 / 3145728
Cache size 106 invocation test: 106 / 3145728
Cache size 107 invocation test: 107 / 3145728
Cache size 108 invocation test: 108 / 3145728
Cache size 109 invocation test: 109 / 3145728
Cache size 110 invocation test: 110 / 3145728
Cache size 111 invocation test: 111 / 3145728
Cache size 112 invocation test: 112 / 3145728
Cache size 113 invocation test: 113 / 3145728
Cache size 114 invocation test: 114 / 3145728
Cache size 115 invocation test: 115 / 3145728
Cache size 116 invocation test: 116 / 3145728
Cache size 117 invocation test: 117 / 3145728
Cache size 118 invocation test: 118 / 3145728
Cache size 119 invocation test: 119 / 3145728
Cache size 120 invocation test: 120 / 3145728
Cache size 121 invocation test: 121 / 3145728
Cache size 122 invocation test: 122 / 3145728
Cache size 123 invocation test: 123 / 3145728
Cache size 124 invocation test: 124 / 3145728
Cache size 125 invocation test: 125 / 3145728
Cache size 126 invocation test: 126 / 3145728
Cache size 127 invocation test: 127 / 3145728
Cache size 128 invocation test: 128 / 3145728
Cache size 129 invocation test: 3145728 / 3145728

Results:
  Renderer: Mesa DRI Intel(R) Haswell Mobile 
  Calculated vertex cache batch size: 3145728
  Cache size: 128
[/code[

http://www.joshbarczak.com/blog/?p=1231

Indeed, despite some shader errors here and there, the data gathered is really good. Thanks everyone!

The test has two parts. The first part just uses a massive 0-filled index buffer and draws it, checking how many times the vertex shader is executed. The second part tries to figure out the cache size by trying a bigger and bigger repeated list of indices (0, 1, …, n, 0, 1, …, n, 0, 1, …, n, …), where n is increased by 1 between each test. At some point, this will start thrashing the cache, as when the number of unique indices is bigger than the cache, it’ll have lost vertex 0 by the time the list of indices repeats, causing every single entry in the index buffer to require a new vertex shader execution.

Let’s go through the results:

Intel seems to be the most straightforward, and literally the only vertex cache that actually works as expected. The GPU loops through the index list and keeps a 128-entry FIFO vertex cache. When drawing an index buffer of length 310241024 filled completely with 0s, it only runs the vertex shader once, then never again. When the number of vertices exceeds the vertex cache, thrashing occurs and every single index needed a vertex shader execution, which is exactly what I had predicted based on “public knowledge” of the vertex cache. This is what people optimize meshes for.

Nvidia’s solution is more complicated. Even if you render an index buffer filled with 0s, the vertex shader will be executed more than once. What is happening here is that the GPU is splitting up the index buffer into chunks of *32, which in the case of triangles is 96. For lines it’s 64 and for points it’s 32. This is what I call the “batch size” in the test results. There seems to be a different vertex cache for each of the batches, so even if the index buffer contains only zeroes the vertex shader will be executed once per batch. This severely limits the usefulness of the cache, as it greatly increases the chance of having to run a vertex shader multiple times as reuse only works within the same 96-index block. In addition, there is a 32-entry FIFO cache within each block as well, so it’s still possible to overflow the cache within each block if it contains more than 32 unique indices. Most likely, this choice was made by Nvidia to allow for more parallelism in hardware, as it allows each 96-index block to be processed completely independently. Intel needs to go through the entire index buffer linearly.
This has major implications on how a mesh should be optimized, as the mesh optimizer needs to be aware of the 96-index blocks to be able to make the best decisions. Otherwise it may assume that a vertex will be reused for two triangles, but the triangles may turn out to be in different 96-index blocks, so the vertex won’t be in the cache there.

AMD’s technique is… very weird. It seems similar to Nvidia’s solution, but the results don’t perfectly match that. The calculated batch size is 387, which is 384+3, which is 3234+3, so the batch size seems to be roughly 4x as big as for Nvidia. That’s a pretty uneven number that I really wasn’t expecting. Most likely, the actual batch size is 384, with some additional weird behavior in there. As for the actual cache size within each size, it’s most likely 16 both for the HD7800 and the KAVERI APU, but the results are again inconsistent. In addition, the results are off by one between the two (8130 vs 8129 invocations). =___= There’s definitely something fishy and complicated going on here. To get anything conclusive that would actually be useful information for a mesh optimizer, I’d need to run more tests. I don’t really have a guess for why the batch size seems so random, but the the discrepancy for the HD7800 not being completely cache thrashed at 16-23 entries could be explained by the GPU updating the cache is small batches (most likely 8 ) instead of one by one. This would explain why the GPU kiiiinda manages to do at least some caching up to 24 entries. There could also be some ordering weirdness here as well.

We really need to do more testing on AMD hardware. If either ClaasJG, Jono or Abuse have time for it, I’d love it if we could continue testing a bit using IRC or Skype to be able to do some more rapid iterations of the test program. Feel free to either PM me or respond in this thread if any of you are interested!

Thanks a lot for all the help, guys!

Radeon HD 290X for the sake of completeness:


Batch size test invocations: 131072 / 50331648
Calculated vertex cache batch size: 384

Cache size 1 invocation test: 131072 / 50331648
Cache size 2 invocation test: 262144 / 50331648
Cache size 3 invocation test: 393216 / 50331648
Cache size 4 invocation test: 524288 / 50331648
Cache size 5 invocation test: 655360 / 50331648
Cache size 6 invocation test: 786432 / 50331648
Cache size 7 invocation test: 917504 / 50331648
Cache size 8 invocation test: 1048576 / 50331648
Cache size 9 invocation test: 1179648 / 50331648
Cache size 10 invocation test: 1310720 / 50331648
Cache size 11 invocation test: 1441792 / 50331648
Cache size 12 invocation test: 1572864 / 50331648
Cache size 13 invocation test: 1703936 / 50331648
Cache size 14 invocation test: 1835008 / 50331648
Cache size 15 invocation test: 6422528 / 50331648
Cache size 16 invocation test: 11927552 / 50331648
Cache size 17 invocation test: 50331648 / 50331648

Results:
  Renderer: AMD Radeon R9 200 Series
  Calculated vertex cache batch size: 384
  Cache size: 16


Oooooof couuuurse. The newer 200 series turns out to exactly 384. Wooh. The plot thickens. Well, I guess that partly confirms my 387 --> 384 hypothesis at least.

I’ve messaged and added Jono and ClaasJG on Skype, but I haven’t gotten any responses yet. If anyone with an AMD card has time to do some vertex cache testing I’d really appreciate the help! It all basically amounts to setting up the program I’ve posted here in an IDE, add LWJGL3 as a dependency, then modifying it a bit to some more fine-grained testing on when exactly the values change. My ultimate goal would be to write a per-vendor mesh optimizer!

Batch size test invocations: 32768 / 3145728
Calculated vertex cache batch size: 96

Cache size 1 invocation test: 32768 / 3145728
Cache size 2 invocation test: 65536 / 3145728
Cache size 3 invocation test: 98304 / 3145728
Cache size 4 invocation test: 131072 / 3145728
Cache size 5 invocation test: 163840 / 3145728
Cache size 6 invocation test: 196608 / 3145728
Cache size 7 invocation test: 229376 / 3145728
Cache size 8 invocation test: 262144 / 3145728
Cache size 9 invocation test: 294912 / 3145728
Cache size 10 invocation test: 327690 / 3145728
Cache size 11 invocation test: 360448 / 3145728
Cache size 12 invocation test: 393216 / 3145728
Cache size 13 invocation test: 425984 / 3145728
Cache size 14 invocation test: 458752 / 3145728
Cache size 15 invocation test: 491520 / 3145728
Cache size 16 invocation test: 524288 / 3145728
Cache size 17 invocation test: 557056 / 3145728
Cache size 18 invocation test: 589824 / 3145728
Cache size 19 invocation test: 622592 / 3145728
Cache size 20 invocation test: 655360 / 3145728
Cache size 21 invocation test: 688128 / 3145728
Cache size 22 invocation test: 720918 / 3145728
Cache size 23 invocation test: 753664 / 3145728
Cache size 24 invocation test: 786432 / 3145728
Cache size 25 invocation test: 819200 / 3145728
Cache size 26 invocation test: 851968 / 3145728
Cache size 27 invocation test: 884736 / 3145728
Cache size 28 invocation test: 917504 / 3145728
Cache size 29 invocation test: 950272 / 3145728
Cache size 30 invocation test: 983040 / 3145728
Cache size 31 invocation test: 1015808 / 3145728
Cache size 32 invocation test: 1048606 / 3145728
Cache size 33 invocation test: 3145728 / 3145728

Results:
Renderer: GeForce GTX 1070/PCIe/SSE2
Calculated vertex cache batch size: 96
Cache size: 32