https://arxiv.org/abs/2504.02263
LLMs have most of their parameters in the FFN parts of the transformer layers — 50+bn params of the Llama 3 70b model, for example. The compute and memory requirements are a bit different between the FFN and attention parts of the model: attention requires a different KV cache for each request, so attention tends to be memory bound while the dense FFNs tend to be compute bound.
Because of this it’s pretty common to split up tasks at inference time. The initial prefill stage (processing the initial prompt) populates the KV cache for the following autoregressive decoding. The decode can be more aggressively batched for getting better utilization. vLLMreally helped popularize this idea!
ByteDance extend this idea for mixture of expert models. In MoEs the compute intensity of the FFNs is limited by needing to load different experts, and having only a proportion of tokens going through a given expert. They extend the disagg idea to go from M “attention” GPUs to N (fewer!) expert GPUs, with a larger batch size for each of the expert calls. This gets better utilization on the matmuls and lowers overall cost of serving. The natural structure of transformer layers alternating attention and FFN lends itself well to a ping-pong pipelining approach that lets them hide the comms overhead.
We present MegaScale-Infer, an efficient and cost-effective system designed for large-scale MoE serving. MegaScale-Infer disaggregates the attention and expert modules, assigning them to separate GPUs—a strategy we term disaggregated expert parallelism. Our approach offers two major benefits. First, it enables independent scaling of each module with customized model parallelism strategies. Specifically, attention modules are replicated using data parallelism, while FFN modules are scaled with expert parallelism. By consolidating requests from multiple attention replicas, the GPU utilization of each expert increases significantly as the batch size per attention replica grows. Second, it enables the deployment of attention and FFN modules on heterogeneous GPUs to fully leverage their different capabilities and achieve lower costs. For example, attention modules can be deployed on GPUs with more cost-effective memory capacity and bandwidth, while FFN modules can utilize GPUs with more affordable compute capability. As shown in Figure 1(c), FFN can easily become compute-intensive in MegaScale-Infer, while attention achieves higher GPU utilization per cost under heterogeneous deploymen
