vLLM Single GPU PD Separation Architecture

aws · network diagram.

About This Architecture

vLLM single GPU with prefill-decode separation partitions a single GPU into two isolated compute zones using libvgpu and libsmctl, allocating ~80% streaming multiprocessors to prefill and ~20% to decode. Requests flow through a proxy router to the prefill instance for prompt processing and KV cache generation, then transfer via NIXL/UCX transport to the decode instance for token generation, with both instances sharing model weights from a unified memory cache. This architecture maximizes GPU utilization by running prefill and decode workloads concurrently on separate partitions, reducing latency and increasing throughput for LLM serving. Fork this diagram on Diagrams.so to customize partition ratios, swap transport protocols, or adapt for multi-GPU clusters. The 80/20 split reflects typical prefill-decode compute ratios; adjust based on your workload's prompt length and batch size characteristics.

People also ask

How can I partition a single GPU to run LLM prefill and decode concurrently with vLLM?

This diagram shows vLLM's prefill-decode separation using libvgpu and libsmctl to allocate ~80% of GPU streaming multiprocessors to prefill (prompt processing, KV generation) and ~20% to decode (token generation). Both instances share model weights via unified memory cache and transfer KV cache through NIXL/UCX transport, enabling concurrent execution and higher throughput.

vLLM Single GPU PD Separation Architecture

AWSadvancedvLLMLLM inferenceGPU partitioningprefill-decodeUCX transport
Domain: Ml PipelineAudience: ML infrastructure engineers optimizing LLM inference throughput on GPU clusters
0 views0 favoritesPublic

Created by

March 12, 2026

Updated

March 12, 2026 at 10:05 AM

Type

network

Need a custom architecture diagram?

Describe your architecture in plain English and get a production-ready Draw.io diagram in seconds. Works for AWS, Azure, GCP, Kubernetes, and more.

Generate with AI