Single-GPU LLM Prefill-Decode Disaggregation
About This Architecture
Single-GPU LLM prefill-decode disaggregation partitions a GPU into two specialized zones—80% SM capacity for prompt processing and 20% for token generation—using libvgpu and libsmctl to maximize throughput. Requests flow through AWS API Gateway, WAF, and ALB to vLLM prefill and decode instances, which share model weights via memory cache and exchange KV cache through UCX/NIXL transport. This architecture solves the latency-throughput tradeoff by running compute-intensive prefill and memory-bound decode workloads in parallel on the same GPU, improving overall LLM serving efficiency. Fork and customize this diagram on Diagrams.so to adapt GPU partitioning ratios, transport protocols, or load balancing strategies for your inference workload. Consider adjusting SM zone ratios based on your prompt length distribution and batch size requirements.
People also ask
How can I maximize LLM inference throughput on a single GPU by separating prefill and decode workloads?
This diagram shows how to partition a single GPU into 80% SM capacity for prefill (prompt processing and KV cache generation) and 20% for decode (token generation) using libvgpu and libsmctl. The prefill and decode instances share model weights via memory cache and exchange KV cache through UCX/NIXL transport, running both workloads in parallel to improve overall serving efficiency.
- Domain:
- Ml Pipeline
- Audience:
- ML infrastructure engineers optimizing LLM inference throughput on GPU hardware
Generated by Diagrams.so — AI architecture diagram generator with native Draw.io output. Fork this diagram, remix it, or download as .drawio, PNG, or SVG.