HENASY: Learning to Assemble Scene-Entities for Interpretable Egocentric Video-Language Model
Authors: Khoa Vo, Thinh Phan, Kashu Yamazaki, Minh Tran, Ngan Le
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our method demonstrates strong interpretability in both quantitative and qualitative experiments; while maintaining competitive performances on five downstream tasks via zero-shot transfer or as video/text representation, including video/text retrieval, action recognition, multi-choice query, natural language query, and moments query. |
| Researcher Affiliation | Academia | Khoa Vo Thinh Phan Kashu Yamazaki Minh Tran Ngan Le AICV Lab, University of Arkansas, Fayetteville, USA {khoavoho,thinhp,kyamazak,minht,thile}@uark.edu |
| Pseudocode | No | The paper describes methods and processes with equations but does not include any clearly labeled "Pseudocode" or "Algorithm" blocks. |
| Open Source Code | No | Project page: https://uark-aicv.github.io/HENASY. Furthermore, we will release our Git Hub implementation soon. |
| Open Datasets | Yes | Publicly available massive-scale egocentric datasets such as Ego4D [1] and Epic Kitchens-100 [2], providing suites of egocentric tasks, have further sparked even more interest within the research community. HENASY is trained on Ego Clip [3], which contains 3.8M clip-narration pairs covering a sub-set of 2,927 video hours from Ego4D [1]. |
| Dataset Splits | No | The paper mentions specific datasets and evaluation protocols (e.g., zero-shot transfer) but does not provide explicit train/validation/test split percentages or sample counts for the datasets used for training or fine-tuning. |
| Hardware Specification | Yes | We train HENASY on two A6000 GPUs, in 5 epochs with Adam W optimizer [37] at fixed learning rate of 3e 5, and with batch size of 128. |
| Software Dependencies | No | The paper does not provide specific version numbers for software dependencies such as Python, PyTorch, or other libraries, only mentioning "Adam W optimizer [37]" and "Llama-2 [34]". |
| Experiment Setup | Yes | Training. HENASY is trained on Ego Clip [3], which contains 3.8M clip-narration pairs covering a sub-set of 2,927 video hours from Ego4D [1]. For each video clip, we uniformly sample 4 frames. We employ the pre-extracted narration s nouns and pre-detected hand and object bounding boxes from [4] for NEC loss and projection loss, respectively. For verb phrases, we employ Llama-2 [34] with a prompt as discussed in Appendix C. The loss weights in Eq. 10 are set as: λ1 = 0.5, λ2 = 0.5, λ3 = 1.0. We train HENASY on two A6000 GPUs, in 5 epochs with Adam W optimizer [37] at fixed learning rate of 3e 5, and with batch size of 128. |