Single-GPU LLM Prefill-Decode Disaggregation

aws · architecture diagram.

About This Architecture

Single-GPU LLM prefill-decode disaggregation partitions a GPU into two specialized zones—80% SM capacity for prompt processing and 20% for token generation—using libvgpu and libsmctl to maximize throughput. Requests flow through AWS API Gateway, WAF, and ALB to vLLM prefill and decode instances, which share model weights via memory cache and exchange KV cache through UCX/NIXL transport. This architecture solves the latency-throughput tradeoff by running compute-intensive prefill and memory-bound decode workloads in parallel on the same GPU, improving overall LLM serving efficiency. Fork and customize this diagram on Diagrams.so to adapt GPU partitioning ratios, transport protocols, or load balancing strategies for your inference workload. Consider adjusting SM zone ratios based on your prompt length distribution and batch size requirements.

People also ask

How can I maximize LLM inference throughput on a single GPU by separating prefill and decode workloads?

This diagram shows how to partition a single GPU into 80% SM capacity for prefill (prompt processing and KV cache generation) and 20% for decode (token generation) using libvgpu and libsmctl. The prefill and decode instances share model weights via memory cache and exchange KV cache through UCX/NIXL transport, running both workloads in parallel to improve overall serving efficiency.

Single-GPU LLM Prefill-Decode Disaggregation

AWSadvancedLLM inferenceGPU optimizationvLLMML infrastructureperformance tuning
Domain: Ml PipelineAudience: ML infrastructure engineers optimizing LLM inference throughput on GPU hardware
0 views0 favoritesPublic

Created by

March 11, 2026

Updated

March 11, 2026 at 10:46 AM

Type

architecture

Need a custom architecture diagram?

Describe your architecture in plain English and get a production-ready Draw.io diagram in seconds. Works for AWS, Azure, GCP, Kubernetes, and more.

Generate with AI