Performance of the tail-call interpreter in Python 3.14

· March 10, 2025

Performance of the Python 3.14 tail-call interpreter – Made of Bugs

Great example of the benefits (and complexities!) of open source today. Nelson Elhage at Anthropic investigated the recent tail-call interpreter improvements in Python 3.14 after being suspicious of the 10-15% claimed performance win. It turned out that the baseline was artificially bad due to a bug in LLVM.

Unfortunately, as I will document in this post, these impressive performance gains turned out to be primarily due to inadvertently working around a regression in LLVM 19. When benchmarked against a better baseline (such GCC, clang-18, or LLVM 19 with certain tuning flags), the performance gain drops to 1-5% or so depending on the exact setup.

That’s still a very impressive speedup for something as well used and improved as the Python interpreter. Ken Jin, the original change author, wrote a nice apologypost but this is a very tricky situation!

In order to avoid catastrophic slowdowns (or memory usage) in certain cases, LLVM 19 implemented some limits on tail-duplication pass, causing it to bail out if duplication would blow up the size of the IR past certain limits. Unfortunately, on CPython those limits resulted in Clang leaving all of the dispatch jumps merged, and entirely undoing the whole purpose of the computed goto-based implementation!

Baselines are a persistent issue, and the blog has some good things to say there. A level of skepticism is good: in this case an optimization in one area ended up being a performance hit in another, which is pretty common. Similarly, while this is a genuine, very challenging, mistake, it’s much more common to (somewhat) intentionally choose a flattering baseline. I recently had discussions at work around something that had delivered a double-digit change, but only when compared against a poor baseline. When measured against a more meaningful alternative, the gain was much more modest. Nelson makes the point succinctly:

I work in machine learning at Anthropic these days, and we see this all the time in ML papers. When a paper comes out claiming some algorithmic improvement or other advance, I’ve noticed that the first detail our researchers ask is often not “What did they do?” but “What baseline did they compare against?” It’s easy to get impressive-looking results if you’re comparing against a poorly-tuned baseline, and that observation turns out to explain a surprising fraction of supposed improvements.

This happens in at large scale too: Google, on a recent earnings call, touted that 25% of their code was AI generated. This was apparently true, but in part that’s replacing the traditional autocomplete with a model copilot. The real question there is what % would have been “machine generated” anyway, and how much of the delta is LLM: very likely a much smaller number, especially for a company with very sophisticated developer infrastructure.

The investigation is also a good counter to some breathless takes around vibe coding: this stuff is hard, and it will take a while to be able to automatically catch, root cause and resolve something like this:

If you’d asked me, a month ago, to estimate the likelihood that an LLVM release caused a 10% performance regression in CPython and that no one noticed for five months, I’d have thought that a pretty unlikely state of affairs! Those are both widely-used projects, both of which care a fair bit about performance, and “surely” someone would have tested and noticed.