Extending Video Masked Autoencoders to 128 frames
Authors: Nitesh Bharadwaj Gundavarapu, Luke Friedman, Raghav Goyal, Chaitra Hegde, Eirikur Agustsson, Sagar Waghmare, Mikhail Sirotenko, Ming-Hsuan Yang, Tobias Weyand, Boqing Gong, Leonid Sigal
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We validate our design choices through exhaustive ablations and observe improved performance of the resulting long-video (128 frames) encoders over short-video (32 frames) counterparts. With our long-video masked autoencoder (LVMAE) strategy, we surpass state-of-the-art on Diving48 by 3.9 points and EPIC-Kitchens-100 verb classification by 2.5 points while relying on a simple core architecture and video-only pre-training (unlike some of the prior works that require millions of labeled video-text pairs or specialized encoders). |
| Researcher Affiliation | Collaboration | 1Google Research 2University of British Columbia {ngundavarapu,lbfried,cvhegde,eirikur}@google.com {rgoyal14,lsigal}@cs.ubc.ca |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | No | We will strive to make the code open-source if the paper is accepted but at this point, we cannot share any code. |
| Open Datasets | Yes | We experiment with LVMAE on conventional video action classification benchmarks that have potential to benefit from long-video encoders: EPIC-Kitchens-100 (EK100) [18] and Diving48 (D48) [19]. |
| Dataset Splits | No | The paper mentions using well-known datasets like EPIC-Kitchens-100 and Diving48, but it does not explicitly provide specific training, validation, and test split percentages, sample counts, or direct citations for the splits used. |
| Hardware Specification | Yes | We implement the code in Scenic [63] and run our pre-training experiments on 128 TPUv5e chips and fine-tuning experiments on 64 TPUv5e chips. |
| Software Dependencies | No | The paper states 'We implement the code in Scenic [63]' but does not provide specific version numbers for Scenic or other key software dependencies. |
| Experiment Setup | Yes | Table 11 provides a 'Model Card with detailed model architecture and training setups for Base size experiments on Diving48 and EPIC-Kitchens', including hyperparameters such as optimizer (Adam), learning rate (1.5e-4 for pre-training, 0.5 for fine-tuning), epochs (1600 for pre-training, 50/200 for fine-tuning), batch size (1024|512 for pre-training, 64 for fine-tuning), weight decay, and augmentation strategies. |