Transformer Architecture (Encoder-Decoder)

general · architecture diagram.

About This Architecture

Encoder-decoder transformer architecture with multi-head attention mechanisms processes input tokens through positional encoding and stacked encoder layers. Input embeddings flow through multi-head self-attention, add & norm, and feed forward network blocks before generating encoder output. Decoder stack consumes shifted output tokens, applies masked multi-head self-attention and cross-attention with encoder output, then produces output probabilities via linear layer and softmax. This canonical architecture powers modern NLP models like BERT, GPT variants, and machine translation systems, demonstrating attention-based sequence modeling without recurrence. Fork this diagram on Diagrams.so to customize layer counts, add residual connections, or adapt for vision transformers and multimodal architectures.

People also ask

How does the transformer encoder-decoder architecture process input tokens through attention mechanisms?

Input tokens flow through embedding and positional encoding into encoder stacks with multi-head self-attention and feed forward layers. Decoder stacks apply masked self-attention to shifted outputs, cross-attention with encoder output, then generate probabilities via linear and softmax layers.

Transformer Architecture (Encoder-Decoder)

Autoadvancedmachine-learningtransformerneural-networksNLPattention-mechanismdeep-learning
Domain: Ml PipelineAudience: machine learning engineers building sequence-to-sequence models
2 views0 favoritesPublic

Created by

February 17, 2026

Updated

February 25, 2026 at 3:51 PM

Type

architecture

Need a custom architecture diagram?

Describe your architecture in plain English and get a production-ready Draw.io diagram in seconds. Works for AWS, Azure, GCP, Kubernetes, and more.

Generate with AI