Deep Learning CUDA IPC Memory Sharing
About This Architecture
Deep learning CUDA IPC memory sharing enables two PyTorch processes—a prefill engine and a decode engine—to efficiently share GPU tensors without duplication via CUDA inter-process communication handles. Process A allocates KV cache and model weights on GPU, extracts CUDA IPC handles using torch.multiprocessing, and serializes metadata (handle_dict.pkl, tensor_meta.pkl, layer_map.pkl) to a shared filesystem. Process B deserializes these handles, rebuilds GPU pointers via a model init interceptor, and maps activation buffers to the same shared VRAM, eliminating redundant allocations. This architecture reduces memory footprint and latency in speculative decoding and batch inference workloads by allowing read-only weight sharing and zero-copy tensor access across process boundaries. Fork this diagram on Diagrams.so to customize IPC transport paths, add monitoring hooks, or adapt for multi-GPU scenarios. The pattern is essential for production LLM serving where memory constraints and inference speed directly impact throughput.
People also ask
How can I share GPU tensors between multiple PyTorch processes without duplicating memory allocations?
Use CUDA IPC handles extracted via torch.multiprocessing to serialize GPU tensor metadata (shape, dtype, strides) and IPC handles to disk. A second process deserializes these handles, rebuilds GPU pointers via a model init interceptor, and maps tensors to the same shared VRAM. This enables zero-copy access to model weights and KV cache across process boundaries, critical for speculative decoding a
- Domain:
- Ml Pipeline
- Audience:
- ML engineers optimizing multi-process deep learning inference with CUDA memory sharing
Generated by Diagrams.so — AI architecture diagram generator with native Draw.io output. Fork this diagram, remix it, or download as .drawio, PNG, or SVG.