Opening the Vocabulary of Egocentric Actions
Authors: Dibyadip Chatterjee, Fadime Sener, Shugao Ma, Angela Yao
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 5 Experiments, We conduct experiments on three datasets: EPIC100 [8] features 100 hours of egocentric footage of daily kitchen activities annotated with 97 verbs and 300 interacting objects. Assembly101 [51] is a multi-view procedural activity dataset featuring 4321 videos of people assembling and disassembling 101 take-apart toy vehicles and is annotated with 24 verbs and 90 objects. |
| Researcher Affiliation | Collaboration | Dibyadip Chatterjee 1 Fadime Sener 2 Shugao Ma 2 Angela Yao 1 1National University of Singapore 2Meta Reality Labs Research |
| Pseudocode | No | The paper describes its methods using text and diagrams but does not contain any structured pseudocode or clearly labeled algorithm blocks. |
| Open Source Code | No | The paper provides a project webpage URL (https://dibschat.github.io/openvocab-ego) but does not explicitly state that the source code for the described methodology is available at this link, nor is it a direct link to a code repository. |
| Open Datasets | Yes | We conduct experiments on three datasets: EPIC100 [8] features 100 hours of egocentric footage of daily kitchen activities annotated with 97 verbs and 300 interacting objects. Assembly101 [51] is a multi-view procedural activity dataset featuring 4321 videos of people assembling and disassembling 101 take-apart toy vehicles and is annotated with 24 verbs and 90 objects. Something-Else [35], is a human-object interaction subset of SSv2 [16] with 174 verb phrases e.g. picking something up'. |
| Dataset Splits | No | The paper provides train and test segment counts in Table 1 (e.g., EPIC100-OV: 63k train seg, 13.8k test seg), and describes how objects are split into base (training) and novel (testing) categories. However, it does not explicitly provide details for a separate validation set split (e.g., percentages or absolute counts for a validation partition). |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU models, CPU types, or memory used for running the experiments. It only mentions model architectures like 'Vi T-B/16' and 'Vi T-L/14'. |
| Software Dependencies | No | The paper mentions using specific models and detectors like S3D, 100DOH, and CLIP, but does not specify any software versions for frameworks (e.g., PyTorch, TensorFlow) or libraries used in the implementation. |
| Experiment Setup | Yes | For all datasets, the verb encoder is pretrained with the OAP contrastive scheme for 300 epochs and then fine-tuned for verb classification for another 100 epochs. Keeping both the CLIP image and text encoders frozen, prompts are learned using AOP for 20 epochs. Unless otherwise stated, we choose k1 and k2, the number of learnable prefix and postfix prompts, to be 16 each. |