VideoMAE Pretraining Pipeline - Physical Load
About This Architecture
VideoMAE pretraining pipeline performs masked spatio-temporal reconstruction on 16-frame video clips sampled at 4-frame intervals from 4K 60fps source video. Raw frames are tokenized into 1,568 spatio-temporal tokens using 16×16 spatial patches and 2-frame temporal tubelets, then 90% are randomly masked, leaving only 157 visible tokens for the transformer encoder to process. The encoder outputs 768-dimensional latent embeddings that feed into a decoder tasked with reconstructing the masked tokens, enabling the model to learn rich video representations without labeled data. Fork this diagram on Diagrams.so to customize frame rates, patch sizes, masking ratios, or model dimensions for your own video foundation model experiments. This architecture demonstrates how aggressive masking (90%) forces the encoder to learn global spatiotemporal context rather than memorizing local pixel patterns.
People also ask
How does VideoMAE tokenize and mask video frames for self-supervised pretraining?
VideoMAE extracts 16-frame clips sampled every 4th frame, tokenizes them into 1,568 spatio-temporal tokens using 16×16 spatial patches and 2-frame temporal tubelets, then randomly masks 90% of tokens. Only the 157 visible tokens are encoded by a 12-layer transformer; the decoder reconstructs the 1,411 masked tokens, forcing the model to learn global video context without labels.
- Domain:
- Ml Pipeline
- Audience:
- Machine learning engineers implementing self-supervised video pretraining with masked reconstruction
Generated by Diagrams.so — AI architecture diagram generator with native Draw.io output. Fork this diagram, remix it, or download as .drawio, PNG, or SVG.