About This Architecture
Deep learning CUDA IPC memory sharing enables two PyTorch processes—a prefill engine and a decode engine—to efficiently share GPU tensors without duplication via CUDA inter-process communication handles. Process A allocates KV cache and model weights on GPU, extracts CUDA IPC handles using torch.multiprocessing, and serializes metadata (handle_dict.pkl, tensor_meta.pkl, layer_map.pkl) to a shared filesystem. Process B deserializes these handles, rebuilds GPU pointers via a model init interceptor, and maps activation buffers to the same shared VRAM, eliminating redundant allocations. This architecture reduces memory footprint and latency in speculative decoding and batch inference workloads by allowing read-only weight sharing and zero-copy tensor access across process boundaries. Fork this diagram on Diagrams.so to customize IPC transport paths, add monitoring hooks, or adapt for multi-GPU scenarios. The pattern is essential for production LLM serving where memory constraints and inference speed directly impact throughput.