Transformer Architecture (Encoder-Decoder)

GENERALArchitectureadvanced

About This Architecture

Encoder-decoder transformer architecture with multi-head attention mechanisms processes input tokens through positional encoding and stacked encoder layers. Input embeddings flow through multi-head self-attention, add & norm, and feed forward network blocks before generating encoder output. Decoder stack consumes shifted output tokens, applies masked multi-head self-attention and cross-attention with encoder output, then produces output probabilities via linear layer and softmax. This canonical architecture powers modern NLP models like BERT, GPT variants, and machine translation systems, demonstrating attention-based sequence modeling without recurrence. Fork this diagram on Diagrams.so to customize layer counts, add residual connections, or adapt for vision transformers and multimodal architectures.

People also ask

How does the transformer encoder-decoder architecture process input tokens through attention mechanisms?

Input tokens flow through embedding and positional encoding into encoder stacks with multi-head self-attention and feed forward layers. Decoder stacks apply masked self-attention to shifted outputs, cross-attention with encoder output, then generate probabilities via linear and softmax layers.

machine-learningtransformerneural-networksNLPattention-mechanismdeep-learning

Domain:: Ml Pipeline
Audience:: machine learning engineers building sequence-to-sequence models

Generated by Diagrams.so — AI architecture diagram generator with native Draw.io output. Fork this diagram, remix it, or download as .drawio, PNG, or SVG.

Generate your own architecturediagram →