VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training

Authors: Zhan Tong, Yibing Song, Jue Wang, Limin Wang

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Notably, our Video MAE with the vanilla Vi T backbone can achieve 87.4% on Kinects-400, 75.4% on Something Something V2, 91.3% on UCF101, and 62.6% on HMDB51, without using any extra data.
Researcher Affiliation Collaboration Zhan Tong 1,2 Yibing Song 2 Jue Wang 2 Limin Wang 1,3 1State Key Laboratory for Novel Software Technology, Nanjing University 2Tencent AI Lab 3Shanghai AI Lab
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code Yes Code is available at https://github.com/MCG-NJU/Video MAE.
Open Datasets Yes We evaluate our Video MAE on five common video datasets: Kinetics-400 [33], Something-Something V2 [25], UCF101 [60], HMDB51 [34], and AVA [26].
Dataset Splits Yes Kinetics-400 contains around 240k training videos and 20k validation videos of 10s from 400 classes. The Something-Something V2 is another large-scale video dataset, having around 169k videos for training and 20k videos for validation. ... UCF101 and HMDB51 are two relatively small video datasets, which contain around 9.5k/3.5k train/val videos and 3.5k/1.5k train/val videos, respectively. ... AVA, a dataset for spatiotemporal localization of human actions with 211k training and 57k validation video segments.
Hardware Specification Yes The wall-clock time of pre-training is benchmarked in 64 Tesla V100 GPUs with Py Torch.
Software Dependencies No The paper mentions 'PyTorch' but does not provide specific version numbers for any software dependencies.
Experiment Setup Yes For fine-tuning, we perform TSN [75] uniform sampling on SSV2 and dense sampling [77, 22] on K400. All models share the same inference protocol, i.e., 2 clips 3 crops on SSV2 and 5 clips 3 crops on K400. ... We take 4 blocks for the decoder by default. ... We find that Video MAE is in favor of extremely high masking ratios (e.g. 90% to 95%). ... We find that the MSE loss could achieve a higher result compared with the L1 loss and smooth L1 loss. Therefore, we employ the MSE loss by default.