Learning State-Aware Visual Representations from Audible Interactions

Authors: Himangi Mittal, Pedro Morgado, Unnat Jain, Abhinav Gupta

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We validate these contributions extensively on two large-scale egocentric datasets, EPIC-Kitchens-100 and the recently released Ego4D, and show improvements on several downstream tasks, including action recognition, long-term action anticipation, and object state change classification.
Researcher Affiliation Collaboration Himangi Mittal1, Pedro Morgado1, Unnat Jain2, Abhinav Gupta1 1Carnegie Mellon University 2Meta AI Research
Pseudocode No The paper does not include explicitly labeled 'Pseudocode' or 'Algorithm' blocks with structured steps formatted like code.
Open Source Code Yes Code and pretrained model are available here: https://github.com/Himangi M/Rep LAI
Open Datasets Yes We evaluate on two egocentric datasets: EPIC-Kitchens-100 [14] and Ego4D [27].
Dataset Splits Yes Video action recognition (AR) on EPIC-Kitchens-100 and Ego4D. Given a short video clip, the task is to classify the verb and noun of the action taking place. This is done using two separate linear classifiers trained for this task. We report the top-1 and top-5 accuracies, following [14] (Tab. 1) and [27] (Tab. 2).
Hardware Specification Yes Models are trained with stochastic gradient descent for 100 epochs with a batch size of 128 trained over 4 GTX 1080 Ti GPUs, a learning rate of 0.005 and a momentum of 0.9. For Ego4D, we use a batch size of 512 trained over 8 RTX 2080 Ti GPUs with a learning rate of 0.05.
Software Dependencies No The paper mentions software components like 'R(2+1)D video encoder' and '2D CNN' but does not specify any version numbers for programming languages, libraries, or frameworks used (e.g., Python, PyTorch, TensorFlow).
Experiment Setup Yes Models are trained with stochastic gradient descent for 100 epochs with a batch size of 128 trained over 4 GTX 1080 Ti GPUs, a learning rate of 0.005 and a momentum of 0.9. For Ego4D, we use a batch size of 512 trained over 8 RTX 2080 Ti GPUs with a learning rate of 0.05. The two loss terms in Eq. 7 are equally weighted with α = 0.5.