LLMs are complicated now

Back in 2022 and 2023 there were two big branches of machine learning happening at Meta1. The LLM work that led to Llama was a clean, smooth stack of repeated Transformer modules; the recommendation systems graphs were, by contrast, terrifying. Luckily, the industry has remedied that state of affairs by making LLMs a lot more complicated.

  1. And many smaller ones, shout outs to all my Content Understanding and integrity peeps 

Read More

FactWorld

When we started building LLMs, we mostly focused on them knowing things. They had information encoded in their weights, and they could spit it out when given sufficient prompts. But an agent doesn’t just need to know things; it needs to combine several kinds of knowledge.

Read More

Somehow, more on distillation

The capabilities in a large language model emerge, mysteriously, from the training data. Everyone agrees that you start with a big pile of data, add some compute, and at the end you can vibe code. Opinions differ on what that pile of data should look like.

Read More

The elusive order of things

SIMT offered a fantastic bargain. You write a straight-line program, the machine runs a lot of copies of it, and when one waits for memory the hardware swaps in others. You look with disdain on the less enlightened thread programmers dealing with deadlocks and concurrency etc. etc.

Read More

Unbundling Work

Recently I had a conversation with an infrastructure team supporting an ML modeling group. The two orgs used to collaborate to ship experiments: the modeling team would come up with ideas, the infra team would augment their frameworks and build out tooling to make those ideas scalable. Together, they would ship an experiment every couple of weeks. Now the modeling team is largely making the framework changes and performance improvements themselves, thanks to coding agents, and are shipping a few experiments every single week. The infra team are still busy, but they are firefighting and debugging when the agents get stuck. The modeling team are much more productive, undeniably, and all the humans are busy, but the work for the infra team has ended up somewhat worse.

Read More

Native DSLs Ops in PyTorch

You may have noticed that FlashAttention 4 was supported in PyTorch really quickly. That required a bit of new infrastructure: torch.native by Simon Layton. Prior versions of FlashAttention were written in Cutlass/C++, but for FA4 the team implemented the kernel in CuteDSL.

Read More