VideoMAE Pretraining Pipeline - Physical Load

GENERALArchitectureadvanced

VideoMAE Pretraining Pipeline - Physical Load — GENERAL architecture diagram

About This Architecture

VideoMAE pretraining pipeline performs masked spatio-temporal reconstruction on 16-frame video clips sampled at 4-frame intervals from 4K 60fps source video. Raw frames are tokenized into 1,568 spatio-temporal tokens using 16×16 spatial patches and 2-frame temporal tubelets, then 90% are randomly masked, leaving only 157 visible tokens for the transformer encoder to process. The encoder outputs 768-dimensional latent embeddings that feed into a decoder tasked with reconstructing the masked tokens, enabling the model to learn rich video representations without labeled data. Fork this diagram on Diagrams.so to customize frame rates, patch sizes, masking ratios, or model dimensions for your own video foundation model experiments. This architecture demonstrates how aggressive masking (90%) forces the encoder to learn global spatiotemporal context rather than memorizing local pixel patterns.

People also ask

How does VideoMAE tokenize and mask video frames for self-supervised pretraining?

VideoMAE extracts 16-frame clips sampled every 4th frame, tokenizes them into 1,568 spatio-temporal tokens using 16×16 spatial patches and 2-frame temporal tubelets, then randomly masks 90% of tokens. Only the 157 visible tokens are encoded by a 12-layer transformer; the decoder reconstructs the 1,411 masked tokens, forcing the model to learn global video context without labels.

video-pretrainingmasked-autoencoderself-supervised-learningtransformer-architecturespatio-temporal-tokenizationml-pipeline

Domain:: Ml Pipeline
Audience:: Machine learning engineers implementing self-supervised video pretraining with masked reconstruction

Generated by Diagrams.so — AI architecture diagram generator with native Draw.io output. Fork this diagram, remix it, or download as .drawio, PNG, or SVG.

Generate your own architecturediagram →

VideoMAE Pretraining Pipeline - Physical Load architecture diagram

About This Architecture

VideoMAE pretraining pipeline performs masked spatio-temporal reconstruction on 16-frame video clips sampled at 4-frame intervals from 4K 60fps source video. Raw frames are tokenized into 1,568 spatio-temporal tokens using 16×16 spatial patches and 2-frame temporal tubelets, then 90% are randomly masked, leaving only 157 visible tokens for the transformer encoder to process. The encoder outputs 768-dimensional latent embeddings that feed into a decoder tasked with reconstructing the masked tokens, enabling the model to learn rich video representations without labeled data. Fork this diagram on Diagrams.so to customize frame rates, patch sizes, masking ratios, or model dimensions for your own video foundation model experiments. This architecture demonstrates how aggressive masking (90%) forces the encoder to learn global spatiotemporal context rather than memorizing local pixel patterns.

People also ask

How does VideoMAE tokenize and mask video frames for self-supervised pretraining?

VideoMAE extracts 16-frame clips sampled every 4th frame, tokenizes them into 1,568 spatio-temporal tokens using 16×16 spatial patches and 2-frame temporal tubelets, then randomly masks 90% of tokens. Only the 157 visible tokens are encoded by a 12-layer transformer; the decoder reconstructs the 1,411 masked tokens, forcing the model to learn global video context without labels.

VideoMAE Pretraining Pipeline - Physical Load

Autoadvancedvideo-pretrainingmasked-autoencoderself-supervised-learningtransformer-architecturespatio-temporal-tokenizationml-pipeline

Domain: Ml PipelineAudience: Machine learning engineers implementing self-supervised video pretraining with masked reconstruction

6 views0 favoritesPublic

Created by

March 8, 2026

Updated

July 3, 2026 at 2:34 PM

Type

architecture

Need a custom architecture diagram?

Describe your architecture in plain English and get a production-ready Draw.io diagram in seconds. Works for AWS, Azure, GCP, Kubernetes, and more.

Generate with AI