Inductor notes

· January 16, 2024

(Updated: Jan 16, 2025)

note-to-self

Inductor is PyTorch’s compiler backend designed to optimize and generate high-performance code for arbitrary models. It works over a few phases:

aotautograd: Capturing the Forward and Backward Pass**

The graph of ops is traced when executing forward or backward graphs. They’re wrapped in torch.fx.GraphModule containers, shapes and sizes are made symbolic, using SymPy.

Operators are decomposed to a standard IR, either AtenIR, or further decomposed into a simpler set of ops (PrimsIR) for backends like Inductor that can do their own fusions. Backends can specify their own decompositions as well, they’re passed when hooking up to AOTAutograd:

prims_decomp = torch._decomp.get_decompositions([
    torch.ops.aten.add,
    torch.ops.aten.expand.default,
])

Inductor Lowering

This is the start of Inductor itself as a backend, and it starts by converting the ATenIR into a Python based define-by-run IR. Define-by-run means it allows dynamic execution – the same IR ops can do different things in different passes.

In this process Inductor:

Eliminates views (operations that do not change the underlying data but modify how tensors are accessed).
Removes broadcasting overhead by explicitly adjusting tensor shapes.
Simplifies indexing patterns to enable more efficient execution.
Does classic compiler things like dead code elimination

Scheduling

The scheduler plans the execution of operations to optimize performance. IT does vertical fusion (operations along the graph) and horizontal fusion (operations across different tensors), sets up tiling and uses reductions for sums, averages etc.

This section also does autotuning – profiling multiple implementations of ops to select the best one, memory planning to avoid bottlenecks and so on.

Code Generation

Finally, Inductor generates the executable code for the target hardware. It has multiple backends it can choose including Triton, OpenMP, CUTLASS, ROCm, XPU (for Intel GPUs) and others.

It also generates kernel wrappers to handle memory allocation and orchestration.

This code lives undertorch/_inductor/codegen.).