reproducibilityindex.ai

VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training

Authors: Zhan Tong, Yibing Song, Jue Wang, Limin Wang

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Notably, our Video MAE with the vanilla Vi T backbone can achieve 87.4% on Kinects-400, 75.4% on Something Something V2, 91.3% on UCF101, and 62.6% on HMDB51, without using any extra data.
Researcher Affiliation	Collaboration	Zhan Tong 1,2 Yibing Song 2 Jue Wang 2 Limin Wang 1,3 1State Key Laboratory for Novel Software Technology, Nanjing University 2Tencent AI Lab 3Shanghai AI Lab
Pseudocode	No	The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code	Yes	Code is available at https://github.com/MCG-NJU/Video MAE.
Open Datasets	Yes	We evaluate our Video MAE on five common video datasets: Kinetics-400 [33], Something-Something V2 [25], UCF101 [60], HMDB51 [34], and AVA [26].
Dataset Splits	Yes	Kinetics-400 contains around 240k training videos and 20k validation videos of 10s from 400 classes. The Something-Something V2 is another large-scale video dataset, having around 169k videos for training and 20k videos for validation. ... UCF101 and HMDB51 are two relatively small video datasets, which contain around 9.5k/3.5k train/val videos and 3.5k/1.5k train/val videos, respectively. ... AVA, a dataset for spatiotemporal localization of human actions with 211k training and 57k validation video segments.
Hardware Specification	Yes	The wall-clock time of pre-training is benchmarked in 64 Tesla V100 GPUs with Py Torch.
Software Dependencies	No	The paper mentions 'PyTorch' but does not provide specific version numbers for any software dependencies.
Experiment Setup	Yes	For fine-tuning, we perform TSN [75] uniform sampling on SSV2 and dense sampling [77, 22] on K400. All models share the same inference protocol, i.e., 2 clips 3 crops on SSV2 and 5 clips 3 crops on K400. ... We take 4 blocks for the decoder by default. ... We find that Video MAE is in favor of extremely high masking ratios (e.g. 90% to 95%). ... We find that the MSE loss could achieve a higher result compared with the L1 loss and smooth L1 loss. Therefore, we employ the MSE loss by default.