vLLM Single GPU PD Separation Architecture
About This Architecture
vLLM single GPU with prefill-decode separation partitions a single GPU into two isolated compute zones using libvgpu and libsmctl, allocating ~80% streaming multiprocessors to prefill and ~20% to decode. Requests flow through a proxy router to the prefill instance for prompt processing and KV cache generation, then transfer via NIXL/UCX transport to the decode instance for token generation, with both instances sharing model weights from a unified memory cache. This architecture maximizes GPU utilization by running prefill and decode workloads concurrently on separate partitions, reducing latency and increasing throughput for LLM serving. Fork this diagram on Diagrams.so to customize partition ratios, swap transport protocols, or adapt for multi-GPU clusters. The 80/20 split reflects typical prefill-decode compute ratios; adjust based on your workload's prompt length and batch size characteristics.
People also ask
How can I partition a single GPU to run LLM prefill and decode concurrently with vLLM?
This diagram shows vLLM's prefill-decode separation using libvgpu and libsmctl to allocate ~80% of GPU streaming multiprocessors to prefill (prompt processing, KV generation) and ~20% to decode (token generation). Both instances share model weights via unified memory cache and transfer KV cache through NIXL/UCX transport, enabling concurrent execution and higher throughput.
- Domain:
- Ml Pipeline
- Audience:
- ML infrastructure engineers optimizing LLM inference throughput on GPU clusters
Generated by Diagrams.so — AI architecture diagram generator with native Draw.io output. Fork this diagram, remix it, or download as .drawio, PNG, or SVG.