GPU Driven

· June 26, 2025

note-to-self

Traditionally, GPU collective network operations were issued from the framework on a separate CUDA stream than the local computation kernel launches. This allowed overlapping comms and hiding most or all of the network latency. NCCL exposes collectives as fully implemented kernels, and there have been various derivitives such as AMD’s RCCL or Berkeley’s new UCCLproject, which is aiming to be a drop-in replacement better suited for large-scale GPU workloads¹. Earlier versions of this sent the networking via the host, and later developed towards GPU-to-GPU peer to peer over connections like NVlink, and direct communication between GPU and the NIC for scale out. But the actual coordination was driven by launches from the CPU.

The increasing capacity of GPUs, particularly in the Hopper/MI300x era, the use of micro-batching, and the rise of Mixture-of-Expert models put a lot more pressure on this arrangement: now rather than a large chunk of comms at the end of each distributed-data-parallel pass you are doing thousands of small exchanges per step. Each step could require each CPU rank to launch thousands of tiny All-to-All: millions of collective calls across the cluster, while still running the rest of the training loop. Each one forces a host interrupt, collective calls and GPU triggers for the data transfer.

A paper from last year, The Landscape of GPU Driven Communication, gives a broad overview of this shift:

In the last decade, several advancements, broadly referred to as GPU-centric communication, have sought to challenge the CPU’s hegemony on multi-GPU execution. At a high level, these advancements reduce the CPU’s involvement in the critical path of execution, give the GPU more autonomy in initiating and synchronizing communication and attempt to address the semantic mismatch between multi-GPU communication and computation.

Getting the right kind of abstractions in here to make this more accessible and flexible is an active area of development. The main reference is the shmem abstraction. This allows writing and reading bytes from remote memory (originally for supercomputers) and adding barriers around usage. Nvidia’s nvshmem (and AMDs ROCshmem) library directly implemented this for GPU.

The next big step up in functionality was GPU Direct Async, a transport that allows access to NIC doorbells directly from the GPU, extending NVSHMEM to RoCE/InfiniBand (Remote Direct Memory Access – RDMA). In addition to this, NVLS (NVLink Sharp) was added to NVLink switches which allowed much more bandwidth efficient switch-managed multicast for broadcast and reduce cases, the fundamental operations used in all-gather and all-reduce collectives. This allows GPU-initiation of collectives which gives us the ingredients to get the CPU out of the networking path completely, and to fuse networking alongside compute operations.

This was one of the things DeepSeek did really well, covered in their DeepEP work: fusing MoE GEMM with GPU-initiated RDMA cut single-node latency by 2–3x. The tradeoff is kernels handle a lot of complexity: choosing between NVLink and GPUDirect for nodes, polling, flow management and so on is tuned to their specific needs and hardware.

ByteDance has also spent a lot of time looking at this problem. Their Fluxpaper looks at more general approach for doing tile-based fusion of the comms and compute:

Flux overdecomposes computation and communication into tiles. Here, since the computation operation is GEMM, and most high-performance GEMM kernels on GPUs are written with tiling, such as thread block tiling or warp tiling, our decomposition can naturally map into existing tiling in the kernels. Flux fuses dependent communication and/or wait logic into a GEMM kernel, and launches only one fused kernel, compared to the prior methods launching multiple split GEMM kernels.

PyTorch has an experimental feature and RFC for SymmetricMemory:

Then, innovative block-wise compute/computation overlapping techniques started to use copy-engine to drive the P2P traffic in order to minimize contention. Now, we are seeing techniques where NVLink communication is directly issued from “compute” kernels.

[…]

Just as Triton allows average users to modify matmuls for their needs (fusion, quantization, etc.), we hope that SymmetricMemory will enable average users to modify NVLink communication algorithms for their requirements, whether it’s implementing alternate collective algorithms (one-shot allreduce), using different quantization approaches (stochastic rounding), or fusing collectives with other kernels (all-gather interleaved with matmuls).

This project is still fairly manual, though it abstracts much of the plumbing required and makes it accessible at a high level.

In a similar vein, ByteDance recently released their implementation of Triton-Distributed: ByteDance-Seed/Triton-distributed: Distributed Compiler Based on Triton for Parallel Systems. This adds a small Triton DSL wrapping the shmem operations, supporting both Nvidia and AMD hardware. It exposes some of the comms knobs to autotuning, allowing tuning across compute and collectives, focusing on the tile-based approach they documented in their TileLink paper.

This is unlikely to make its way into upstream Triton, in part because the general approach of Triton is to hide much of the hardware information in the compiler passes, while this approach makes the topology fairly explicit.

The PyTorch team has been working on traceable collectives for the PyTorch compiler. The idea of a compiler being able to look at the overall task graph and decide where to fuse comms and which transport options to use is appealing, as it allows kernels to be more transparent to the specifics of the cluster.

The move to GPU-initiated, fine-grain comms that fuse into compute kernels is real and continuing; the tooling is still early, but the gap will continue to close.

This currently host-side controlled, but they have plans for GPU-driven comms as well ↩