The GPU Mode discord has emerged as the preeminent hub for current and aspiring GPU kernel hackers, and several of the folks there have kicked off a project to help make it easier for folks to write and benchmark them. https://gpu-mode.github.io/discord-cluster-manager/docs/intro/ goes over the idea, but it’s a series of leaderboards and runners for different kernel types so you can easily find (and beat!) the state of the art:
We designed this leaderboard as a central and open-source resource for people to find the fastest kernels for the devices they are using. Furthermore, these open-community kernels will be useful in the future for designing automated methods for optimized kernel generation.
The latter part there is one of the interesting points. Fundamentally custom kernels are an optimization on a model architecture, and like any optimization its natural to look for a system to automatically create that for you. ML compilers do a good job of certain graph optimizations, autotuning (searching for good kernel choices) and building specific versions from templates, but those templates are generally based on hand-written, high performance kernels for specific needs and shapes. It’s natural to see how LLMs do with this problem, and up to now the answer has been “pretty mid”.
To that end, Sakana recently wrote about their efforts to build a system to generate high performance kernels from PyTorch model code with an agentic system: https://sakana.ai/ai-cuda-engineer/ – it has generated a lot of kernels (17k!)
They chose to output CUDA , rather than CUTLASS, Triton, or another higher level framework, and they use an LLM to functionalize the PyTorch code, rather than use torch.compile and work on the exported graph:
Functional Conversion: We first evaluate the LLMs’ ability to convert torch modules into parameterized function calls (stage 1). Our analysis of 250 KernelBench tasks (fig. 6) reveals distinct performance patterns across complexity levels. All tested LLMs successfully generate equivalent functional implementations for basic operations and simple fused operations (level 1, 2). However, for complex composed architectures (level 3), reasoning models (o1-high, o1-preview, o3-mini-high) demonstrate superior robustness, converting more than 45 tasks compared to sonnet3.5’s 42 tasks.
One nice trick was they self-improved generation through adding in examples from similar, previously generated kernels, which improved the success rate:
Retrieval-Augmented CUDA Kernel Translation & Optimization: Building upon these results, we enhanced our system’s capabilities through RAG. By leveraging our growing ’innovation archive’ of translated and optimized kernels, RAG significantly improved both translation and optimization capabilities.
