Sharing GPU Memory Between Processes Using cudaMemPool

in

But don’t freak out, because I’m here to break it down for you in the most casual way possible.

So let me ask you this: have you ever found yourself struggling with sharing data between multiple CUDA processes? You know, like when you want to pass a massive dataset from one process to another without having to copy everything over again? Well, my friend, I’ve got some good news for you.

Introducing cudaMemPool the savior of your GPU memory woes! This nifty little feature allows you to allocate and manage shared GPU memory in a way that is both efficient and easy to use. And let me tell ya, its about time we had something like this.

Cooperative Groups, another recent addition to the CUDA programming model, enables synchronization of groups of threads smaller than a thread block as well as groups that span an entire kernel launch running on one or even multiple GPUs. This feature can be particularly useful for complex GPU algorithms and image editing instructions using Instructpix2pix.

So how does it work? Well, first you create a cudaMemPool object using the cudaMallocManaged() function. Then, whenever you want to access that memory from another process or thread group, all you have to do is call cudaHostAlloc() or cudaMalloc() and pass in a pointer to the cudaMemPool object.

But let’s dive deeper into Cooperative Groups. These groups allow for synchronization of threads that are smaller than a thread block, as well as those that span an entire kernel launch running on one or even multiple GPUs. This feature can be particularly useful in complex GPU algorithms and image editing instructions using Instructpix2pix.

Cooperative Groups also enable the creation and synchronization of thread groups that span an entire kernel launch running on one or even multiple GPUs, thanks to new features in Pascal and Volta GPUs. This is covered in a follow-up post we plan to publish soon, so stay tuned!

In terms of examples, there are various ones included in the CUDA Toolkit version 9 or higher that use Cooperative Groups. One example is warp-aggregated atomics, which can be used as a drop-in replacement for atomicAdd() and reduces the number of atomics performed by up to the number of threads in a warp (up to 32x on current GPUs). This feature can dramatically improve performance, particularly when working with large datasets.

To implement this technique using Cooperative Groups, we use the coalesced_group type, which ranks only threads that are part of the group. Here’s an example implementation:

“`c++
// This function uses Cooperative Groups to perform an atomic increment operation on a given integer pointer.
// It takes in a pointer to an integer and returns the previous value of the pointer after the increment.
__device__ int atomicAggInc(int *ptr) {
// Define a coalesced_group object, which ranks only threads that are part of the group.
cg::coalesced_group g = cg::coalesced_threads();
int prev;

// Elect the first active thread in the group to perform the atomic add operation.
if (g.thread_rank() == 0) {
// Use the atomicAdd function to increment the value at the given pointer by the size of the group.
prev = atomicAdd(ptr, g.size());
}

// Broadcast the previous value within the warp and add each active thread’s rank to it.
// This ensures that all threads in the group have the same previous value before adding their ranks to it.
prev = g.thread_rank() + g.shfl(prev, 0);
return prev;
}
“`

This implementation elects the first active thread in the group using a conditional statement based on its thread rank within the coalesced group. The broadcast operation is performed by adding each active thread’s rank to the previous value and then shuffling it across the warp using the shfl() intrinsic function.

In terms of resources, there are several papers that cover Cooperative Groups in more detail: “Cooperative Groups for Flexible and Explicit Thread Coordination” by NVIDIA Corporation (2018), “Learning to Follow Image Editing Instructions with Instructpix2Pix” by Stanford University et al. (2023), “Analog Bits: Generating Discrete Data Using Diffusion Models with Self-Conditioning” by MIT and Google Brain (2022), and “Diffusion models beat GANs on image synthesis” by Seoul National University and Princeton University et al. (2021).

In terms of future developments, there are plans to cover the grid_group and multi_grid_group types in a follow-up post that will be published soon. Stay tuned!

SICORPS