Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Masked Autoencoders As Spatiotemporal Learners
Authors: Christoph Feichtenhofer, haoqi fan, Yanghao Li, Kaiming He
NeurIPS 2022 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We report competitive results on several challenging video datasets using vanilla Vision Transformers [18]. We report strong results on a variety of video recognition datasets. In Sec. 5.1 and Sec. 5.2 we perform ablation experiments on Kinetics-400 (K400) [35]. We report top-1 classification accuracy (%) on the K400 validation set. |
| Researcher Affiliation | Industry | Christoph Feichtenhofer Haoqi Fan Yanghao Li Kaiming He Meta AI, FAIR |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Meta AI, FAIR https://github.com/facebookresearch/mae_st |
| Open Datasets | Yes | Kinetics-400 (K400) [35], Image Net-1K (IN1K) [14], AVA [29], and Something Something v2 (SSv2) [27]. |
| Dataset Splits | Yes | We report top-1 classification accuracy (%) on the K400 validation set. The 16 frames are sampled from the raw video with a temporal stride of 4 (i.e., 16 4 sampling in the literature [21]), and the starting frame is randomly sampled. In the spatial domain, we perform random resized cropping [63] with a scale range of [0.5, 1], and random horizontal flipping. Our inference process follows the common practice of multi-view testing [74, 21]: it takes K temporal clips (by default K=7 on Kinetics) to cover the video length, and for each clip it takes 3 spatial views to cover the longer spatial axis (denoted as K 3). |
| Hardware Specification | Yes | Here the x-axis is the wall-clock training time (128 A100 GPUs), and the y-axis is the 1-view accuracy on Kinetics-400 validation. The speedup is closer to 5.8 if using slower GPUs (V100 instead of A100) that can hide the loading time. |
| Software Dependencies | No | The paper mentions using the Adam W optimizer but does not provide specific version numbers for any software dependencies or libraries. |
| Experiment Setup | Yes | Our MAE pre-training configuration mostly follows [31]. We use the Adam W optimizer [43] with a batch size of 512. Our default input size is 16 frames each with 224 224 pixels (i.e., 16 224 224). The 16 frames are sampled from the raw video with a temporal stride of 4 (i.e., 16 4 sampling in the literature [21]), and the starting frame is randomly sampled. In the spatial domain, we perform random resized cropping [63] with a scale range of [0.5, 1], and random horizontal flipping. We use a temporal patch size of 2 [2, 19, 77] and a spatial patch size of 16 16 [18], denoted as 2 16 16. The pre-training length is 800 epochs. |