Vulkan multi-GPU support is out! + other features

theagentd · April 1, 2017, 9:15pm

rQqwG_rQx7A

Vulkan multi-GPU support is now out as an experimental KHX extension to Vulkan called KHX_device_group!

EDIT2: It IS possible to synchronize memory between GPUs in a device group! See below!

I was expecting Vulkan’s multi-GPU support to be extremely complicated, forcing you to essentially manage two completely different Vulkan devices and replicating commands to both separate GPUs, but this is simply not at all the case. Vulkan’s multi-GPU support seems to be very similar to DirectX 12’s. Basically, multiple GPUs from the same vendor (usually connected with an SLI/Crossfire bridge for direct data transfer between the GPUs) are reported as a device group, which is still exposed as a single logical device to the user. In other words, your code will look almost exactly identical, with just a few extra calls to direct commands to different devices, making adding multi-GPU support significantly easier, to the point where I’d even say it’s trivial.

When allocating device memory, you’re now able to tell the driver exactly which devices that should actually allocate memory. In other words, you do not need to have the same memory allocated on all devices. This can lead to some reductions in memory usage, but in all likelyhood it’s not a significant gain. You’re still most likely going to want to upload your textures to all devices and allocate your render targets on all your devices.

There are a couple of features for controlling which device does what. A number of functions take in a device mask, which controls which devices are supposed to actually execute the commands. Additionally, when recording a command buffer there is a new command, vkCmdSetDeviceMaskKHX(), that is used to control which GPUs are to execute the following commands.

A big part of the additions is related to presenting images from a device group. Again, this seems to be mostly automated and I haven’t exactly figured out how it’s supposed to work, but it seems to just be a change to image acquiring and presenting where you tell the driver which GPU should have access to which image. This part is rather uninteresting; presenting just got even more complicated. =_=

From what I can tell, multi-GPU support is lacking a number of important features.

DirectX 12 is currently showing off asymmetrical device groups formed by the user manually, allowing you to do some pretty crazy stuff like combining a mobile Nvidia GPU with the integrated Intel GPU and managing the workload between them. Vulkan is limited to the device groups the driver exposes, which limits multi-GPU support to two identical GPUs for Nvidia, or two similar GPUs from the same generation on AMD.

- There doesn’t seem to be a way to manually copy a texture or buffer from one of the devices to another in a device group. o_O This is basically the critical feature that is the whole point of manual multi-GPU support, as it’s the foundation of everything. Without that, the fancy new pipelined multi-GPU technique that people are talking about is completely impossible to implement. Hell, even split-frame rendering (SFR), checkerboard rendering and even non-trivial alternate frame rendering (AFR) is impossible without the ability to manually trigger copies. However, there seems to be a presentation mode that sums up the colors of all the images of all devices. My guess is that this is supposed to be used with split-frame rendering, allowing each device to render part of the frame (when starting a render pass, you can give each device its own sub-rectangle to render to) and then the presentation engine to merge the result. This is however extremely limiting, as manually synchronizing resources for SFR is necessary for a large number of effects, like bloom, SSAO, etc.
EDIT: A separate extension called EXT_discard_rectangles allows the user to define a number of rectangles that the rendering is clipped against. In addition, these rectangles can be set per device in a device group, allowing for checkerboard multi-GPU rendering, but without manual synchronization this would again break any postprocessing effect that requires neighboring pixels.

All in all, I’m positively surprised by the simplicity of the system they’ve chosen, but also disappointed in primarily the seeming lack of manual synchronization control, rendering the whole thing kinda useless. =/ The ability to manually start an asynchronous copy using the transfer engine over the SLI/Crossfire bridge is the entire point of manual multi-GPU support, as that’s the work that the driver teams of Nvidia and AMD have been doing manually for a decade now, leading to buggy, hacky solutions that either don’t even work or never got implemented in a huge number of AAA games, let alone indie games. Since DirectX 12 seems to support this, I can only assume that the lack of this feature is the reason why the extension is classified as an experimental extension, and hence will be updated to include this before actual release.

Still, I am very happy to get reliable information on how multi-GPU support is going to work in Vulkan, to the point where I can confidently continue on my abstraction layer without fear of having to rewrite it.

EDIT2: There IS support for synchronizing memory between GPUs! It is called “peer memory”. The flow seems to be something like this:

In a device group with multiple discrete GPUs, the device local GPU heap flags will have the VK_MEMORY_HEAP_MULTI_INSTANCE_BIT_KHX flag set. This indicates that data allocated from this heap by default is replicated on all GPUs in the group.
Peer memory features can be queried per device group, heap type and device pair (local + remote). The device group must support copying to and from peer memory (=copying between GPUs), but can also support even generic access to memory on other GPUs (=any access, for example texture reads, SSBO reads, etc)!
When allocating memory, the VK_MEMORY_ALLOCATE_DEVICE_MASK_BIT_KHX flag bit can be sent in together with a device mask, meaning that memory is only allocated for certain devices in the device group. However, if the subsetAllocation property is not available to the device group, the allocation may consume memory from all devices regardless of which devices are selected using the device mask. Regardless, this means that each device in the device group can have its own instance of memory.
When binding memory to a buffer or image, you can additionally specify which instance of memory is bound for which device in the device group. This allows you to essentially create two buffer objects that read from the same memory location, but from different GPUs. Confused yet?

Let’s say you have a device group with four GPUs and you want to render using AFR (Alternate Frame Rendering), meaning that each GPU renders every fourth frame completely on its own. However, you realize that to render a frame you need access to the previous color buffer for some temporal anti-aliasing you’re doing. In other words, each GPU needs to read an image from the previous GPU. Here’s what you’d do to initialize the whole thing:

Query the device group properties for how you can share your memory. Let’s say that only COPY_DST access is available (the only one required to be available by the spec), meaning that we need to copy the image from one GPU to another to be able to sample it like a texture/image.
We allocate the memory for the image we want to synchronize as normal, just making sure that the heap we’re allocating from has the VK_MEMORY_HEAP_MULTI_INSTANCE_BIT_KHX bit set. This means that each GPU will allocate its own instance of the image’s memory. Let’s call this “memory 1”.
To be able to copy the image from one GPU to another and to simplify the synchronization, we also need to have another identical image allocated. As our device group doesn’t support direct access to other GPUs, this image is needed. It’s memory is allocated exactly like the previous one. Let’s call this “memory 2”.
We create a Vulkan image (“image 1”) and bind it as usual to memory 1. This image object will be used when each GPU draws its own frame.
We create a Vulkan image (“image 2”) and bind it to memory 2 the same way. This image object will hold the previous image copied from the previous GPU.
We create a third Vulkan image object (“image 3”), also bound to memory 2, but to the next device’s instance of that memory. This is done by passing in a device index list which makes each device read from a different instance of the memory. In the previous two memory bindings, we (implicitly) told each of the four devices to use instances {0, 1, 2, 3} of the memory, which simply means that device 0 uses instance 0, device 1 uses instance 1. In other words, each device uses its own local instance. For this third binding, we’re going to tell the devices to bind to the “next” device’s instance by passing in device indices {1, 2, 3, 0}. This gives device 0 access to device 1’s instance, device 1 access to device 2’s instance, etc. This means that each device can use this weird third image object to copy to the next GPU’s memory.

Now, the actual synchronization process is fairly complicated.

We direct all our rendering commands to device 0 for the first frame. We have no previous frame yet so we simply ignore it.
Device 0 renders to image 1 (in other words, it renders to its own instance of memory 1). We attach a semaphore (“semaphore 1”) which is signaled when the rendering to the image is completed.
Still on device 0, we go to the dedicated transfer queue and submit a command buffer containing a copy from image 1 to image 3 (in other words, a copy from device 0’s current image to device 1’s previous image). We tell this command buffer to await semaphore 1 (so we don’t start transfering before rendering is complete) and tell it to signal another semaphore (“semaphore 2”) when the copy is complete.
Device 0 continue with some extra postprocessing, finishes up its frame and presents its result to the window.
We’re now gonna start submitting commands for Device 1.
Device 1 starts rendering its own frame to its own instance of image 1 until we reach the point where we need access to the previous frame.
For this part, we submit a command buffer and tell it to await semaphore 2. Once semaphore 2 is signaled, the copy from device 0 to device 1 is complete, and device 1 can access the previous frame from device 0 using image 2!
Go to step 3 and repeat (but for device 1 instead, then 2, then 3, then 0, etc).

Phew! What a mess! But it should work! It’s actually not that complicated in practice… xd

Additions from KHR_descriptor_update_template:

Allows you to create a descriptor update template, which can be used to update a specific part of a descriptor set for a huge number of descriptor sets quickly. The use case is when you need to update a large number of descriptor sets with the same change. To do that, a template for the change is created, and then vkUpdateDescriptorSetWithTemplateKHR() is called to update all descriptor sets with the new change in one call. This is faster than a manual vkUpdateDescriptorSets() with one VkWriteDescriptorSet struct for each descriptor set to update.

Additions from KHR_push_descriptor:

Normally, the user allocates descriptor sets, binds them in command buffers and is then unable to modify the descriptor set until the command buffers that use the set has completed execution. This extension allows the user to define descriptor set layouts that instead read their data from the command buffer, called push descriptors. When a command buffer begins recording a command buffer the push descriptors are all undefined, so the user needs to call vkCmdPushDescriptorSetKHR() to update all push descriptors used by the shader. From what I can tell, this is mostly just a convenience feature so that you can avoid having to set up a new descriptor set each frame for descriptor sets that change each frame. Instead you’d just use push descriptors and inject the updates into the command buffer instead. Really nice feature, as it removes the need for per-frame tracking of descriptor sets that update each frame.

Additions from KHR_incremental_present:

Allows you to only present individual rectangles of the screen. This means that you can avoid redrawing the entire screen if only a small section of the screen needs to be redrawn. This is mostly a win on mobile where it can save a lot of battery.

Additions from KHR_maintenance1:

Copy/aliasing compatibility betweeen 2D texture arrays and 3D textures.
Allowing negative viewport heights to make it possible to do a y-inversion of the scene. This is great for OGL compatibility!!!
Command pool trimming, to hint to the driver to release unused memory. Better than force releasing everything when clearing a command pool. Think ArrayList.trimToSize() for command pools.
Some error reporting improvements.