D3-UNet-v5 Semantic Segmentation Architecture
About This Architecture
D3-UNet-v5 combines a CNN encoder with DINOv3 vision transformer features for multi-scale semantic segmentation on high-resolution imagery like Cityscapes datasets. The architecture processes 1024×2048 input images through four progressive encoder stages (64–512 channels), fuses DINOv3 ViT patch tokens via multi-scale projection, and reconstructs dense predictions through an ASPP bottleneck and symmetric decoder with skip connections. This hybrid CNN-ViT design leverages self-supervised transformer knowledge to improve segmentation accuracy while maintaining spatial precision across 19 semantic classes. Fork this diagram on Diagrams.so to customize encoder depths, fusion strategies, or adapt the architecture for your own vision datasets and OCI ML infrastructure. The SE Block squeeze-and-excitation modules enhance channel-wise feature recalibration, critical for balancing multi-scale information in dense prediction tasks.
People also ask
How does D3-UNet-v5 combine CNN and vision transformer features for semantic segmentation?
D3-UNet-v5 processes input images through a CNN encoder with four progressive stages while extracting DINOv3 ViT patch tokens in parallel. Multi-scale projection aligns transformer features to each encoder stage, fuses them via concatenation or addition, and passes the combined representation through an ASPP bottleneck with SE blocks for channel recalibration before symmetric upsampling and skip-c
- Domain:
- Ml Pipeline
- Audience:
- Machine learning engineers building semantic segmentation models for autonomous driving and scene understanding
Generated by Diagrams.so — AI architecture diagram generator with native Draw.io output. Fork this diagram, remix it, or download as .drawio, PNG, or SVG.