VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training
Authors: Zhan Tong, Yibing Song, Jue Wang, Limin Wang
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Notably, our Video MAE with the vanilla Vi T backbone can achieve 87.4% on Kinects-400, 75.4% on Something Something V2, 91.3% on UCF101, and 62.6% on HMDB51, without using any extra data. |
| Researcher Affiliation | Collaboration | Zhan Tong 1,2 Yibing Song 2 Jue Wang 2 Limin Wang 1,3 1State Key Laboratory for Novel Software Technology, Nanjing University 2Tencent AI Lab 3Shanghai AI Lab |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code is available at https://github.com/MCG-NJU/Video MAE. |
| Open Datasets | Yes | We evaluate our Video MAE on five common video datasets: Kinetics-400 [33], Something-Something V2 [25], UCF101 [60], HMDB51 [34], and AVA [26]. |
| Dataset Splits | Yes | Kinetics-400 contains around 240k training videos and 20k validation videos of 10s from 400 classes. The Something-Something V2 is another large-scale video dataset, having around 169k videos for training and 20k videos for validation. ... UCF101 and HMDB51 are two relatively small video datasets, which contain around 9.5k/3.5k train/val videos and 3.5k/1.5k train/val videos, respectively. ... AVA, a dataset for spatiotemporal localization of human actions with 211k training and 57k validation video segments. |
| Hardware Specification | Yes | The wall-clock time of pre-training is benchmarked in 64 Tesla V100 GPUs with Py Torch. |
| Software Dependencies | No | The paper mentions 'PyTorch' but does not provide specific version numbers for any software dependencies. |
| Experiment Setup | Yes | For fine-tuning, we perform TSN [75] uniform sampling on SSV2 and dense sampling [77, 22] on K400. All models share the same inference protocol, i.e., 2 clips 3 crops on SSV2 and 5 clips 3 crops on K400. ... We take 4 blocks for the decoder by default. ... We find that Video MAE is in favor of extremely high masking ratios (e.g. 90% to 95%). ... We find that the MSE loss could achieve a higher result compared with the L1 loss and smooth L1 loss. Therefore, we employ the MSE loss by default. |