Single GPU LLM Prefill-Decode Disaggregation

aws · architecture diagram.

About This Architecture

Single GPU LLM prefill-decode disaggregation partitions a physical GPU into two vLLM instances—80% for prompt processing and KV cache generation, 20% for token streaming and output generation. Requests route through AWS WAF, API Gateway, and ALB to a proxy layer that distributes traffic to prefill and decode partitions sharing model weights in memory. KV cache transport between partitions uses NIXL/UCX cross-process communication over a dedicated buffer layer, while libvgpu and SM Zone Controller manage GPU streaming multiprocessor allocation. CloudWatch and X-Ray provide end-to-end observability of latency and tracing across the disaggregated pipeline. This architecture maximizes single-GPU throughput by eliminating idle compute during latency-bound decode phases, critical for cost-sensitive inference workloads. Fork and customize this diagram on Diagrams.so to model your own GPU partitioning strategy or integrate alternative transport mechanisms.

People also ask

How do you maximize single GPU throughput for LLM inference by disaggregating prefill and decode workloads?

This diagram shows GPU prefill-decode disaggregation: partition one physical GPU into 80% SM for vLLM prefill (prompt processing, KV cache generation) and 20% SM for vLLM decode (token streaming). Use libvgpu and SM Zone Controller for partitioning, NIXL/UCX for KV cache transport between partitions, and AWS WAF, API Gateway, ALB for request routing. This eliminates idle GPU compute during latency

Single GPU LLM Prefill-Decode Disaggregation

AWSadvancedLLM inferenceGPU partitioningvLLMML infrastructurecost optimization
Domain: Ml PipelineAudience: ML infrastructure engineers optimizing LLM inference on GPU-constrained environments
1 views0 favoritesPublic

Created by

March 11, 2026

Updated

March 13, 2026 at 2:02 AM

Type

architecture

Need a custom architecture diagram?

Describe your architecture in plain English and get a production-ready Draw.io diagram in seconds. Works for AWS, Azure, GCP, Kubernetes, and more.

Generate with AI