It’s always the learning rates
June 28, 2026
Pre-training any kind of good LLM is very, very expensive. Thankfully, we have scaling laws. Lilian Weng of Thinky writes:
Needless text
June 28, 2026
Pre-training any kind of good LLM is very, very expensive. Thankfully, we have scaling laws. Lilian Weng of Thinky writes:
June 19, 2026
Back in 2022 and 2023 there were two big branches of machine learning happening at Meta1. The LLM work that led to Llama was a clean, smooth stack of repeated Transformer modules; the recommendation systems graphs were, by contrast, terrifying. Luckily, the industry has remedied that state of affairs by making LLMs a lot more complicated.
And many smaller ones, shout outs to all my Content Understanding and integrity peeps ↩
June 12, 2026
When we started building LLMs, we mostly focused on them knowing things. They had information encoded in their weights, and they could spit it out when given sufficient prompts. But an agent doesn’t just need to know things; it needs to combine several kinds of knowledge.
June 5, 2026
The capabilities in a large language model emerge, mysteriously, from the training data. Everyone agrees that you start with a big pile of data, add some compute, and at the end you can vibe code. Opinions differ on what that pile of data should look like.
May 31, 2026
There has been a lot of drama1 about distillation: how (closed) frontier models are being used by other labs to boost their own performance on particularly hard tasks.
And/or marketing. ↩
May 27, 2026
A thing you can do is take the most performance and correctness sensitive part of your stack and just ask a chatbot to write it for you. They will sometimes get it right!
May 25, 2026
SIMT offered a fantastic bargain. You write a straight-line program, the machine runs a lot of copies of it, and when one waits for memory the hardware swaps in others. You look with disdain on the less enlightened thread programmers dealing with deadlocks and concurrency etc. etc.
April 27, 2026
If you want to see what a very painful couple of months looks like for an ML research team, FAIR’s logbook of the OPT-175 pretraining from 2021 should top your list. The first few runs are basically:
April 8, 2026
Recently I had a conversation with an infrastructure team supporting an ML modeling group. The two orgs used to collaborate to ship experiments: the modeling team would come up with ideas, the infra team would augment their frameworks and build out tooling to make those ideas scalable. Together, they would ship an experiment every couple of weeks. Now the modeling team is largely making the framework changes and performance improvements themselves, thanks to coding agents, and are shipping a few experiments every single week. The infra team are still busy, but they are firefighting and debugging when the agents get stuck. The modeling team are much more productive, undeniably, and all the humans are busy, but the work for the infra team has ended up somewhat worse.
March 17, 2026
You may have noticed that FlashAttention 4 was supported in PyTorch really quickly. That required a bit of new infrastructure: torch.native by Simon Layton. Prior versions of FlashAttention were written in Cutlass/C++, but for FA4 the team implemented the kernel in CuteDSL.