Single-GPU LLM Prefill-Decode Disaggregation

AWSArchitectureadvanced
Single-GPU LLM Prefill-Decode Disaggregation — AWS architecture diagram

About This Architecture

Single-GPU LLM prefill-decode disaggregation partitions a GPU into two specialized zones—80% SM capacity for prompt processing and 20% for token generation—using libvgpu and libsmctl to maximize throughput. Requests flow through AWS API Gateway, WAF, and ALB to vLLM prefill and decode instances, which share model weights via memory cache and exchange KV cache through UCX/NIXL transport. This architecture solves the latency-throughput tradeoff by running compute-intensive prefill and memory-bound decode workloads in parallel on the same GPU, improving overall LLM serving efficiency. Fork and customize this diagram on Diagrams.so to adapt GPU partitioning ratios, transport protocols, or load balancing strategies for your inference workload. Consider adjusting SM zone ratios based on your prompt length distribution and batch size requirements.

People also ask

How can I maximize LLM inference throughput on a single GPU by separating prefill and decode workloads?

This diagram shows how to partition a single GPU into 80% SM capacity for prefill (prompt processing and KV cache generation) and 20% for decode (token generation) using libvgpu and libsmctl. The prefill and decode instances share model weights via memory cache and exchange KV cache through UCX/NIXL transport, running both workloads in parallel to improve overall serving efficiency.

AWSLLM inferenceGPU optimizationvLLMML infrastructureperformance tuning
Domain:
Ml Pipeline
Audience:
ML infrastructure engineers optimizing LLM inference throughput on GPU hardware

Generated by Diagrams.so — AI architecture diagram generator with native Draw.io output. Fork this diagram, remix it, or download as .drawio, PNG, or SVG.

Generate your own architecture diagram →

About This Architecture

Single-GPU LLM prefill-decode disaggregation partitions a GPU into two specialized zones—80% SM capacity for prompt processing and 20% for token generation—using libvgpu and libsmctl to maximize throughput. Requests flow through AWS API Gateway, WAF, and ALB to vLLM prefill and decode instances, which share model weights via memory cache and exchange KV cache through UCX/NIXL transport. This architecture solves the latency-throughput tradeoff by running compute-intensive prefill and memory-bound decode workloads in parallel on the same GPU, improving overall LLM serving efficiency. Fork and customize this diagram on Diagrams.so to adapt GPU partitioning ratios, transport protocols, or load balancing strategies for your inference workload. Consider adjusting SM zone ratios based on your prompt length distribution and batch size requirements.

People also ask

How can I maximize LLM inference throughput on a single GPU by separating prefill and decode workloads?

This diagram shows how to partition a single GPU into 80% SM capacity for prefill (prompt processing and KV cache generation) and 20% for decode (token generation) using libvgpu and libsmctl. The prefill and decode instances share model weights via memory cache and exchange KV cache through UCX/NIXL transport, running both workloads in parallel to improve overall serving efficiency.

Single-GPU LLM Prefill-Decode Disaggregation

AWSadvancedLLM inferenceGPU optimizationvLLMML infrastructureperformance tuning
Domain: Ml PipelineAudience: ML infrastructure engineers optimizing LLM inference throughput on GPU hardware
5 views0 favoritesPublic

Created by

March 11, 2026

Updated

May 7, 2026 at 8:45 PM

Type

architecture

Need a custom architecture diagram?

Describe your architecture in plain English and get a production-ready Draw.io diagram in seconds. Works for AWS, Azure, GCP, Kubernetes, and more.

Generate with AI