About This Architecture
Swin Transformer Encoder implements a four-stage hierarchical architecture that progressively downsamples input images while increasing channel depth through patch partitioning and shifted-window attention blocks. The pipeline begins with patch embedding at H/4 × W/4 × 96, then cascades through Stage 1 (2 blocks, 96-ch), Stage 2 (6 blocks, 192-ch), Stage 3 (18 blocks, 384-ch), and Stage 4 (2 blocks, 768-ch), with skip connections preserving multi-scale feature maps for downstream tasks. This hierarchical design reduces computational complexity compared to vanilla transformers while maintaining strong feature representation across spatial scales. Fork this diagram on Diagrams.so to customize layer counts, embedding dimensions, or adapt it for your vision model documentation. The architecture is ideal for dense prediction tasks like segmentation and object detection where multi-resolution features are critical.