Ping Pong GEMM from Scratch

· March 18, 2025

links-and-recs

Following in the tradition of workedkernel examples, Bert, of the PyTorch and Triton teams at Meta, writes up his experience developing a fast Ping-Pong kernel with TMA (fast loading on Hopper/H100) from scratch. As you might expect there are some good insights from debugging and working through the problems.

You know what actually made it super obvious? Programming. I filled a shared memory buffer with consecutive integers — basically the smem equivalent of torch.arange(64*128).bfloat16().reshape(64, 128), and then TMA-transferred that to GMEM with 128B swizzling, cudaMemcpyed it back to the host, and printed it out. This actually made it crystal clear! I wrote the swizzle function correctly on my first try 😄.

All the code, and the walk through, are in the repo!