About This Architecture
Single-GPU LLM prefill-decode disaggregation partitions a GPU into two specialized zones—80% SM capacity for prompt processing and 20% for token generation—using libvgpu and libsmctl to maximize throughput. Requests flow through AWS API Gateway, WAF, and ALB to vLLM prefill and decode instances, which share model weights via memory cache and exchange KV cache through UCX/NIXL transport. This architecture solves the latency-throughput tradeoff by running compute-intensive prefill and memory-bound decode workloads in parallel on the same GPU, improving overall LLM serving efficiency. Fork and customize this diagram on Diagrams.so to adapt GPU partitioning ratios, transport protocols, or load balancing strategies for your inference workload. Consider adjusting SM zone ratios based on your prompt length distribution and batch size requirements.