Egocentric Video-Language Pretraining

Authors: Kevin Qinghong Lin, Jinpeng Wang, Mattia Soldan, Michael Wray, Rui Yan, Eric Z. XU, Difei Gao, Rong-Cheng Tu, Wenzhe Zhao, Weijie Kong, Chengfei Cai, WANG HongFa, Dima Damen, Bernard Ghanem, Wei Liu, Mike Zheng Shou

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Furthermore, we demonstrate strong performance on five egocentric downstream tasks across three datasets: video-text retrieval on EPIC-KITCHENS-100; action recognition on Charades-Ego; natural language query, moment query, and object state change classification on Ego4D challenge benchmarks. We conduct extensive experiments to demonstrate the superiority of Egocentric VLP by transferring our pretrained representation to five egocentric downstream benchmarks and achieving state-of-the-art performance
Researcher Affiliation Collaboration 1Show Lab, National University of Singapore 2University of Bristol 3King Abdullah University of Science and Technology 4Tencent Data Platform
Pseudocode No The paper describes its proposed methods and formulations mathematically (e.g., Eq. 1, Eq. 2, Ego NCE formulation) but does not include any explicit pseudocode or algorithm blocks.
Open Source Code Yes The dataset and code are available at https://github.com/showlab/EgoVLP.
Open Datasets Yes The paper uses well-known, publicly available datasets such as Ego4D, EPIC-KITCHENS-100, and Charades-Ego, citing them appropriately. For example: 'We exploit the recently released Ego4D dataset to pioneer Egocentric VLP along three directions.' and '[15] Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, et al. Ego4d: Around the world in 3,000 hours of egocentric video. In CVPR, pages 18995 19012, 2022.'
Dataset Splits Yes The paper specifies exact counts for validation sets across multiple tasks: 'The training set contains 67.2K clips and validation set contains 9.7K clips.' (EPIC-KITCHENS-100), 'The training set contains 11.3K queries annotated from 1K clips for this task, while the validation contains 3.9K queries collected from 0.3K clips.' (Natural Language Query), 'The validation set contains 847 videos for classification' (Charades-Ego), 'The training set contains 13.6K instances from 1.5K clips, while the validation set contains 4.3K instances from 0.5K clips.' (Moment Query), and 'The training and val. sets contain 41K and 28K clips, respectively.' (OSCC).
Hardware Specification Yes Pretraining takes two days on 32 A100 GPUs (1, 536 GPU hrs).
Software Dependencies No The paper mentions using specific software components like 'Frozen [3]', 'Time Sformer [32]', 'Distill BERT [33]', and 'Adam optimizer [41]' but does not provide specific version numbers for these software libraries or tools.
Experiment Setup Yes During pretraining, we sample 4 frames for each clip, and use the Adam optimizer [41] with a learning rate of 3e-5. To select the best method we pretrain our architecture for 10 epochs and use the best performing model on the Ego MCQ benchmark.