Masked Autoencoders As Spatiotemporal Learners

Authors: Christoph Feichtenhofer, haoqi fan, Yanghao Li, Kaiming He

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We report competitive results on several challenging video datasets using vanilla Vision Transformers [18]. We report strong results on a variety of video recognition datasets. In Sec. 5.1 and Sec. 5.2 we perform ablation experiments on Kinetics-400 (K400) [35]. We report top-1 classification accuracy (%) on the K400 validation set.
Researcher Affiliation Industry Christoph Feichtenhofer Haoqi Fan Yanghao Li Kaiming He Meta AI, FAIR
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code Yes Meta AI, FAIR https://github.com/facebookresearch/mae_st
Open Datasets Yes Kinetics-400 (K400) [35], Image Net-1K (IN1K) [14], AVA [29], and Something Something v2 (SSv2) [27].
Dataset Splits Yes We report top-1 classification accuracy (%) on the K400 validation set. The 16 frames are sampled from the raw video with a temporal stride of 4 (i.e., 16 4 sampling in the literature [21]), and the starting frame is randomly sampled. In the spatial domain, we perform random resized cropping [63] with a scale range of [0.5, 1], and random horizontal flipping. Our inference process follows the common practice of multi-view testing [74, 21]: it takes K temporal clips (by default K=7 on Kinetics) to cover the video length, and for each clip it takes 3 spatial views to cover the longer spatial axis (denoted as K 3).
Hardware Specification Yes Here the x-axis is the wall-clock training time (128 A100 GPUs), and the y-axis is the 1-view accuracy on Kinetics-400 validation. The speedup is closer to 5.8 if using slower GPUs (V100 instead of A100) that can hide the loading time.
Software Dependencies No The paper mentions using the Adam W optimizer but does not provide specific version numbers for any software dependencies or libraries.
Experiment Setup Yes Our MAE pre-training configuration mostly follows [31]. We use the Adam W optimizer [43] with a batch size of 512. Our default input size is 16 frames each with 224 224 pixels (i.e., 16 224 224). The 16 frames are sampled from the raw video with a temporal stride of 4 (i.e., 16 4 sampling in the literature [21]), and the starting frame is randomly sampled. In the spatial domain, we perform random resized cropping [63] with a scale range of [0.5, 1], and random horizontal flipping. We use a temporal patch size of 2 [2, 19, 77] and a spatial patch size of 16 16 [18], denoted as 2 16 16. The pre-training length is 800 epochs.