TIL: Numerical stability in TorchScript vs Export

· November 5, 2024

til

A very good passing comment from Nikita at work. When folks train in eager mode they are generally able to leverage the full set of aten ops, with their attendant implementations in different backends.

Taking the trained model and exporting using TorchScript was largely guaranteed the same results, assuming it TorchScripted cleanly, as it’s just using libtorch underneath. The downside of this is shipping with a lot of stuff you probably don’t need, which is not great for LiteInterpreter on mobile in particular, plus of course the downside of the all the TorchScript weirdness.

Torch.export offers a much cleaner flow, but on the flip side does a lot more processing of the model. The aten ops are decomposed into a simpler set of ~200 ops for the export IR, which might then be further processed for specific hardware. While you have a lot of control of this process, it means there are a lot of places to potentially introduce subtle differences between training time and inference, further training or whatever you are doing next.

The important thing here is not to treat the two technologies as straight substitutes but as different paths to the same goal, that require different processes around them.