https://trent.me/articles/pytorch-and-python-free-threading/
Trent Nelson has written an extremely detailed breakdown of his experiments with running inference on GPT-2 on PyTorch and the GIL-free version of Python from 3.13 and 3.14.
He implements parallel generation using multiple threads (on one GPU and later multiple devices), parallel model loading, and then some of the challenges with torch.compile (which doesn’t work great with nogil yet!)
Hopefully this encourages more folks to experiment with free-threaded Python, or perhaps port their existing Python packages to play nicely when installed in a free-threaded Python environment. I personally can’t wait until free-threaded Python is the default! Although that’s probably at least five or so years out at this point.
Free threaded python really changes the performance trade-offs around Python, and I expect it to be the default for ML work a lot sooner than that!
