Scientific discovery and AI

· May 19, 2025

I got fooled by AI-for-science hype—here’s what it taught me

LinkedIn-esque headline but a good guest post on Tim Lee’s Understanding AI newsletter by physicist Nick McGreivy.

Later research found that scientists who use AI are more likely to publish top-cited papers and receive on average three times as many citations. With such strong incentives to use AI, it isn’t surprising that so many scientists are doing so.

So even when AI achieves genuinely impressive results in science, that doesn’t mean that AI has done something useful for science. More often, it reflects only the potential of AI to be useful down the road.

The problems Nick describes where he find PDE solving (the area he was looking into) had a lot of techniques which didn’t end up improving on non-ML approaches, feels very common. AI research likes to hill-climb metrics. It’s often the lack of progress on a certain benchmark that motivates new techniques, like the growth of test-time compute over the past year to drive math and logic performance higher.

It brings to mind Zhengdong Wang’s fantastic year-in-review letter from last year.

The model does the eval is the backbone of how one should access and marshall their intuitions into a coherent view on AI progress.

The first awesome conclusion of the model does the eval is that we will achieve every evaluation we can state. Recall that evaluations must be legible, fast, and either a good approximation of a wanted capability or useful itself. The plummeting cost of compute has made all evaluations faster.

[…]

Add human intelligence to direct the cheaper compute to get more legible evaluations. Two years ago, Demis Hassabis enumerated three properties of problems suitable for AI: a massive combinatorial search space, a clear objective function to optimize against, and lots of data or an efficient simulator.

We tend to succeed where we have the evals and we have the data. Having the evals also starts to create a common lingua-franca to discuss relative performance, not that eliminates the baseline hacking Nick discusses.

The evals are often tied to having a good quality core data set that can be used for both training and evaluations. Even in areas where we have had scientific progress, mainly AlphaFold and descendants, as Derek Lowe often writes, we have a major leg-up with the existence of the PDB, an extensive database of high-quality protein structures created by people.

When we look back at major breakthroughs, we often credit that aspect: Dr Fei-Fei Li is one of the pioneers of deep learning thanks in part to the creation of ImageNet. I hope that one takeaway of scientists reading Nick’s note is that the creation of quality benchmarks and datasets can drive more progress than the application of (or innovation on) new ML techniques themselves!