Following in the tradition of workedkernel examples, Bert, of the PyTorch and Triton teams at Meta, writes up his experience developing a fast Ping-Pong kernel with TMA (fast loading on Hopper/H100) from scratch. As you might expect there are some good insights from debugging and working through the problems.
You know what actually made it super obvious? Programming. I filled a shared memory buffer with consecutive integers — basically the smem equivalent of
torch.arange(64*128).bfloat16().reshape(64, 128), and then TMA-transferred that to GMEM with 128B swizzling,cudaMemcpyed it back to the host, and printed it out. This actually made it crystal clear! I wrote the swizzle function correctly on my first try 😄.
All the code, and the walk through, are in the repo!
