Extending Video Masked Autoencoders to 128 frames

Authors: Nitesh Bharadwaj Gundavarapu, Luke Friedman, Raghav Goyal, Chaitra Hegde, Eirikur Agustsson, Sagar Waghmare, Mikhail Sirotenko, Ming-Hsuan Yang, Tobias Weyand, Boqing Gong, Leonid Sigal

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We validate our design choices through exhaustive ablations and observe improved performance of the resulting long-video (128 frames) encoders over short-video (32 frames) counterparts. With our long-video masked autoencoder (LVMAE) strategy, we surpass state-of-the-art on Diving48 by 3.9 points and EPIC-Kitchens-100 verb classification by 2.5 points while relying on a simple core architecture and video-only pre-training (unlike some of the prior works that require millions of labeled video-text pairs or specialized encoders).
Researcher Affiliation Collaboration 1Google Research 2University of British Columbia {ngundavarapu,lbfried,cvhegde,eirikur}@google.com {rgoyal14,lsigal}@cs.ubc.ca
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code No We will strive to make the code open-source if the paper is accepted but at this point, we cannot share any code.
Open Datasets Yes We experiment with LVMAE on conventional video action classification benchmarks that have potential to benefit from long-video encoders: EPIC-Kitchens-100 (EK100) [18] and Diving48 (D48) [19].
Dataset Splits No The paper mentions using well-known datasets like EPIC-Kitchens-100 and Diving48, but it does not explicitly provide specific training, validation, and test split percentages, sample counts, or direct citations for the splits used.
Hardware Specification Yes We implement the code in Scenic [63] and run our pre-training experiments on 128 TPUv5e chips and fine-tuning experiments on 64 TPUv5e chips.
Software Dependencies No The paper states 'We implement the code in Scenic [63]' but does not provide specific version numbers for Scenic or other key software dependencies.
Experiment Setup Yes Table 11 provides a 'Model Card with detailed model architecture and training setups for Base size experiments on Diving48 and EPIC-Kitchens', including hyperparameters such as optimizer (Adam), learning rate (1.5e-4 for pre-training, 0.5 for fine-tuning), epochs (1600 for pre-training, 50/200 for fine-tuning), batch size (1024|512 for pre-training, 64 for fine-tuning), weight decay, and augmentation strategies.