AbsenceBench

· June 21, 2025

https://arxiv.org/abs/2506.11440

Simon Willison has a good summary:

Long context models have been getting increasingly good at passing “Needle in a Haystack” tests recently, but what about a problem in the opposite direction?

The answers are surprisingly domain-specific; some models do great on numeric sequences but most are pretty bad at code!

The authors posit that attention is just a worse mechanism for seeing what’s missing vs what’s there. For me this rhymes with the experience of folks doing agentic coding assistant work: its beneficial to clear the context window more often than you think as the models strongly prefer to use what is already in there.

This feels like a learned or tuned behavior, a flavor of the model does the eval. Models will probably get better at this problem, as now it’s legible, but is there a tradeoff that has to be made?

Pretraining is somewhat saturating, but we have oodles of post-training (which includes context extension), the whole meta-RL process of researchers trying different data mixes and algorithm/architecture tweaks, and inference time search options.

If OpenAI had Anthropic’s data and evals would they have as good an agentic coding model? And vice versa would Opus be as good at deep research as O3? I honestly don’t know: in the end compute will always be finite and we have to allocate it with some end in mind. It feels very plausible there is no globally optimal scaling law for how you prioritize different model capabilities. But the models will probably do this eval.