About This Architecture
VideoMAE pretraining pipeline performs masked spatio-temporal reconstruction on 16-frame video clips sampled at 4-frame intervals from 4K 60fps source video. Raw frames are tokenized into 1,568 spatio-temporal tokens using 16×16 spatial patches and 2-frame temporal tubelets, then 90% are randomly masked, leaving only 157 visible tokens for the transformer encoder to process. The encoder outputs 768-dimensional latent embeddings that feed into a decoder tasked with reconstructing the masked tokens, enabling the model to learn rich video representations without labeled data. Fork this diagram on Diagrams.so to customize frame rates, patch sizes, masking ratios, or model dimensions for your own video foundation model experiments. This architecture demonstrates how aggressive masking (90%) forces the encoder to learn global spatiotemporal context rather than memorizing local pixel patterns.