About This Architecture
Encoder-decoder transformer architecture with multi-head attention mechanisms processes input tokens through positional encoding and stacked encoder layers. Input embeddings flow through multi-head self-attention, add & norm, and feed forward network blocks before generating encoder output. Decoder stack consumes shifted output tokens, applies masked multi-head self-attention and cross-attention with encoder output, then produces output probabilities via linear layer and softmax. This canonical architecture powers modern NLP models like BERT, GPT variants, and machine translation systems, demonstrating attention-based sequence modeling without recurrence. Fork this diagram on Diagrams.so to customize layer counts, add residual connections, or adapt for vision transformers and multimodal architectures.