About This Architecture
Weight reuse and shared cache architecture optimizes GPU memory utilization by intercepting model parameters, serializing weights to disk via IPC handles, and aligning them in physical memory for zero-copy access across prefill and decode instances. CPU orchestration coordinates three phases: parameter interception, serialization with save/load modes, and physical memory alignment, enabling multiple GPU instances to access identical weights without duplication. This pattern reduces memory footprint and improves inference throughput for large language models by leveraging memory-mapped IPC paths and persistent weight caching. Fork this diagram on Diagrams.so to customize the architecture for your inference serving stack, adjust phase timing, or integrate with your orchestration layer. The placeholder load mode with zero-dimensional tensors avoids redundant allocations during decode, making this design ideal for batched inference workloads.