Monarch: PyTorch Single Controller
June 10, 2025
I’ve been excited for this to make it to OSS: The PyTorch team at Meta recently soft-launched Monarch on Github.
Needless text
June 10, 2025
I’ve been excited for this to make it to OSS: The PyTorch team at Meta recently soft-launched Monarch on Github.
June 10, 2025
June 5, 2025
June 1, 2025
May 30, 2025
File this under the “gross oversimplifications” category. The basic approach to keeping GPUs busy is dividing the work into tiles, smaller sub-problems that make up the larger result. For a GEMM you might break the matrix into 128×128 or 128×64 tiles and let each CUDA thread block (CTA) own one tile. The GPU has many streaming multiprocessors (an A100 has 108) and every SM picks up one CTA at a time. If you want to know how many SMs your own card has you can call:
May 28, 2025
Don’t blindly tie every piece of work to top-level metrics. Even if technically feasible, the cost is too high and the risk of spurious logic chains significant.
May 27, 2025
May 24, 2025
Fused Linear Cross-Entropy is a popular optimization that combines the final linear projection and cross-entropy loss into a single operation. This fusion is very valuable for training large language models efficiently, as it can reduce memory usage significant, particularly for larger vocabularies.
May 23, 2025
Daniel Schmachtenberger has made the argument:
May 19, 2025