Learning State-Aware Visual Representations from Audible Interactions
Authors: Himangi Mittal, Pedro Morgado, Unnat Jain, Abhinav Gupta
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We validate these contributions extensively on two large-scale egocentric datasets, EPIC-Kitchens-100 and the recently released Ego4D, and show improvements on several downstream tasks, including action recognition, long-term action anticipation, and object state change classification. |
| Researcher Affiliation | Collaboration | Himangi Mittal1, Pedro Morgado1, Unnat Jain2, Abhinav Gupta1 1Carnegie Mellon University 2Meta AI Research |
| Pseudocode | No | The paper does not include explicitly labeled 'Pseudocode' or 'Algorithm' blocks with structured steps formatted like code. |
| Open Source Code | Yes | Code and pretrained model are available here: https://github.com/Himangi M/Rep LAI |
| Open Datasets | Yes | We evaluate on two egocentric datasets: EPIC-Kitchens-100 [14] and Ego4D [27]. |
| Dataset Splits | Yes | Video action recognition (AR) on EPIC-Kitchens-100 and Ego4D. Given a short video clip, the task is to classify the verb and noun of the action taking place. This is done using two separate linear classifiers trained for this task. We report the top-1 and top-5 accuracies, following [14] (Tab. 1) and [27] (Tab. 2). |
| Hardware Specification | Yes | Models are trained with stochastic gradient descent for 100 epochs with a batch size of 128 trained over 4 GTX 1080 Ti GPUs, a learning rate of 0.005 and a momentum of 0.9. For Ego4D, we use a batch size of 512 trained over 8 RTX 2080 Ti GPUs with a learning rate of 0.05. |
| Software Dependencies | No | The paper mentions software components like 'R(2+1)D video encoder' and '2D CNN' but does not specify any version numbers for programming languages, libraries, or frameworks used (e.g., Python, PyTorch, TensorFlow). |
| Experiment Setup | Yes | Models are trained with stochastic gradient descent for 100 epochs with a batch size of 128 trained over 4 GTX 1080 Ti GPUs, a learning rate of 0.005 and a momentum of 0.9. For Ego4D, we use a batch size of 512 trained over 8 RTX 2080 Ti GPUs with a learning rate of 0.05. The two loss terms in Eq. 7 are equally weighted with α = 0.5. |