reproducibilityindex.ai

Extending Video Masked Autoencoders to 128 frames

Authors: Nitesh Bharadwaj Gundavarapu, Luke Friedman, Raghav Goyal, Chaitra Hegde, Eirikur Agustsson, Sagar Waghmare, Mikhail Sirotenko, Ming-Hsuan Yang, Tobias Weyand, Boqing Gong, Leonid Sigal

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We validate our design choices through exhaustive ablations and observe improved performance of the resulting long-video (128 frames) encoders over short-video (32 frames) counterparts. With our long-video masked autoencoder (LVMAE) strategy, we surpass state-of-the-art on Diving48 by 3.9 points and EPIC-Kitchens-100 verb classiﬁcation by 2.5 points while relying on a simple core architecture and video-only pre-training (unlike some of the prior works that require millions of labeled video-text pairs or specialized encoders).
Researcher Affiliation	Collaboration	1Google Research 2University of British Columbia {ngundavarapu,lbfried,cvhegde,eirikur}@google.com {rgoyal14,lsigal}@cs.ubc.ca
Pseudocode	No	The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code	No	We will strive to make the code open-source if the paper is accepted but at this point, we cannot share any code.
Open Datasets	Yes	We experiment with LVMAE on conventional video action classiﬁcation benchmarks that have potential to beneﬁt from long-video encoders: EPIC-Kitchens-100 (EK100) [18] and Diving48 (D48) [19].
Dataset Splits	No	The paper mentions using well-known datasets like EPIC-Kitchens-100 and Diving48, but it does not explicitly provide specific training, validation, and test split percentages, sample counts, or direct citations for the splits used.
Hardware Specification	Yes	We implement the code in Scenic [63] and run our pre-training experiments on 128 TPUv5e chips and ﬁne-tuning experiments on 64 TPUv5e chips.
Software Dependencies	No	The paper states 'We implement the code in Scenic [63]' but does not provide specific version numbers for Scenic or other key software dependencies.
Experiment Setup	Yes	Table 11 provides a 'Model Card with detailed model architecture and training setups for Base size experiments on Diving48 and EPIC-Kitchens', including hyperparameters such as optimizer (Adam), learning rate (1.5e-4 for pre-training, 0.5 for fine-tuning), epochs (1600 for pre-training, 50/200 for fine-tuning), batch size (1024\|512 for pre-training, 64 for fine-tuning), weight decay, and augmentation strategies.