Masked Autoencoders As Spatiotemporal Learners
Authors: Christoph Feichtenhofer, haoqi fan, Yanghao Li, Kaiming He
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We report competitive results on several challenging video datasets using vanilla Vision Transformers [18]. We report strong results on a variety of video recognition datasets. In Sec. 5.1 and Sec. 5.2 we perform ablation experiments on Kinetics-400 (K400) [35]. We report top-1 classification accuracy (%) on the K400 validation set. |
| Researcher Affiliation | Industry | Christoph Feichtenhofer Haoqi Fan Yanghao Li Kaiming He Meta AI, FAIR |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Meta AI, FAIR https://github.com/facebookresearch/mae_st |
| Open Datasets | Yes | Kinetics-400 (K400) [35], Image Net-1K (IN1K) [14], AVA [29], and Something Something v2 (SSv2) [27]. |
| Dataset Splits | Yes | We report top-1 classification accuracy (%) on the K400 validation set. The 16 frames are sampled from the raw video with a temporal stride of 4 (i.e., 16 4 sampling in the literature [21]), and the starting frame is randomly sampled. In the spatial domain, we perform random resized cropping [63] with a scale range of [0.5, 1], and random horizontal flipping. Our inference process follows the common practice of multi-view testing [74, 21]: it takes K temporal clips (by default K=7 on Kinetics) to cover the video length, and for each clip it takes 3 spatial views to cover the longer spatial axis (denoted as K 3). |
| Hardware Specification | Yes | Here the x-axis is the wall-clock training time (128 A100 GPUs), and the y-axis is the 1-view accuracy on Kinetics-400 validation. The speedup is closer to 5.8 if using slower GPUs (V100 instead of A100) that can hide the loading time. |
| Software Dependencies | No | The paper mentions using the Adam W optimizer but does not provide specific version numbers for any software dependencies or libraries. |
| Experiment Setup | Yes | Our MAE pre-training configuration mostly follows [31]. We use the Adam W optimizer [43] with a batch size of 512. Our default input size is 16 frames each with 224 224 pixels (i.e., 16 224 224). The 16 frames are sampled from the raw video with a temporal stride of 4 (i.e., 16 4 sampling in the literature [21]), and the starting frame is randomly sampled. In the spatial domain, we perform random resized cropping [63] with a scale range of [0.5, 1], and random horizontal flipping. We use a temporal patch size of 2 [2, 19, 77] and a spatial patch size of 16 16 [18], denoted as 2 16 16. The pre-training length is 800 epochs. |