Single GPU LLM Prefill-Decode Disaggregation
About This Architecture
Single GPU LLM prefill-decode disaggregation partitions a physical GPU into two vLLM instances—80% for prompt processing and KV cache generation, 20% for token streaming and output generation. Requests route through AWS WAF, API Gateway, and ALB to a proxy layer that distributes traffic to prefill and decode partitions sharing model weights in memory. KV cache transport between partitions uses NIXL/UCX cross-process communication over a dedicated buffer layer, while libvgpu and SM Zone Controller manage GPU streaming multiprocessor allocation. CloudWatch and X-Ray provide end-to-end observability of latency and tracing across the disaggregated pipeline. This architecture maximizes single-GPU throughput by eliminating idle compute during latency-bound decode phases, critical for cost-sensitive inference workloads. Fork and customize this diagram on Diagrams.so to model your own GPU partitioning strategy or integrate alternative transport mechanisms.
People also ask
How do you maximize single GPU throughput for LLM inference by disaggregating prefill and decode workloads?
This diagram shows GPU prefill-decode disaggregation: partition one physical GPU into 80% SM for vLLM prefill (prompt processing, KV cache generation) and 20% SM for vLLM decode (token streaming). Use libvgpu and SM Zone Controller for partitioning, NIXL/UCX for KV cache transport between partitions, and AWS WAF, API Gateway, ALB for request routing. This eliminates idle GPU compute during latency
- Domain:
- Ml Pipeline
- Audience:
- ML infrastructure engineers optimizing LLM inference on GPU-constrained environments
Generated by Diagrams.so — AI architecture diagram generator with native Draw.io output. Fork this diagram, remix it, or download as .drawio, PNG, or SVG.