About This Architecture
vLLM single GPU with prefill-decode separation partitions a single GPU into two isolated compute zones using libvgpu and libsmctl, allocating ~80% streaming multiprocessors to prefill and ~20% to decode. Requests flow through a proxy router to the prefill instance for prompt processing and KV cache generation, then transfer via NIXL/UCX transport to the decode instance for token generation, with both instances sharing model weights from a unified memory cache. This architecture maximizes GPU utilization by running prefill and decode workloads concurrently on separate partitions, reducing latency and increasing throughput for LLM serving. Fork this diagram on Diagrams.so to customize partition ratios, swap transport protocols, or adapt for multi-GPU clusters. The 80/20 split reflects typical prefill-decode compute ratios; adjust based on your workload's prompt length and batch size characteristics.