About This Architecture
Single GPU LLM prefill-decode disaggregation partitions a physical GPU into two vLLM instances—80% for prompt processing and KV cache generation, 20% for token streaming and output generation. Requests route through AWS WAF, API Gateway, and ALB to a proxy layer that distributes traffic to prefill and decode partitions sharing model weights in memory. KV cache transport between partitions uses NIXL/UCX cross-process communication over a dedicated buffer layer, while libvgpu and SM Zone Controller manage GPU streaming multiprocessor allocation. CloudWatch and X-Ray provide end-to-end observability of latency and tracing across the disaggregated pipeline. This architecture maximizes single-GPU throughput by eliminating idle compute during latency-bound decode phases, critical for cost-sensitive inference workloads. Fork and customize this diagram on Diagrams.so to model your own GPU partitioning strategy or integrate alternative transport mechanisms.