Quack CuteDSL Kernels

· July 18, 2025

Dao-AILab/quack: A Quirky Assortment of CuTe Kernels

Tri Dao & co have a fun repo up called Quack: A Quirky Assortment of CuTe Kernels, all leveraging the CuTe-DSL. These are hopper and blackwell oriented kernels for a variety of common needs like softmax, layernorm and RMSNorm.

On top of that, they wrote a post on how to get speed of light (memory bound) kernels in CuTe-DSL. It goes through how to implement a reduction op across multiple tiers of memory using TensorSSA for thread level reductions, warp reduction with shuffle_sync_bfly and block reduction with shared memory. Even if you’re not writing CuTe, this is about as good an introduction to architecting memory bound ops as I have seen!

They also cover clustered reduction, leveraging multiple SMs:

In cluster reduction, we first send the current warp’s reduced value to all the peer thread block’s reduction buffer in peer’s SMEM. Such sending is conducted via a dedicated SM-to-SM fabric (as DSMEM). Then each warp fetches all warp’s values from their local reduction buffer, and reduces these values.

This does seem to help the kernels scale well to larger sizes:

We believe our outstanding performance at >= 65k input is due to our successful utilization of cluster reduction in H100. When the size of inputs are ultra long and depleting the SM’s registers and shared memory, without cluster reduction, we would have to switch to an online algorithm (like online softmax) otherwise we may get a massive register spilling that leads to significant throughput degradation.

I also really appreciate this note of reality in their conclusion:

Hitting “speed-of-light” model memory throughput confirms that a carefully hand-crafted CuTe kernel can squeeze every byte across all memory hierarchies in the hardware. But that efficiency comes at the price of per-operator and even per input-shape tuning, which imposes a natural tradeoff between efficiency and development efforts