HENASY: Learning to Assemble Scene-Entities for Interpretable Egocentric Video-Language Model

Authors: Khoa Vo, Thinh Phan, Kashu Yamazaki, Minh Tran, Ngan Le

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our method demonstrates strong interpretability in both quantitative and qualitative experiments; while maintaining competitive performances on five downstream tasks via zero-shot transfer or as video/text representation, including video/text retrieval, action recognition, multi-choice query, natural language query, and moments query.
Researcher Affiliation Academia Khoa Vo Thinh Phan Kashu Yamazaki Minh Tran Ngan Le AICV Lab, University of Arkansas, Fayetteville, USA {khoavoho,thinhp,kyamazak,minht,thile}@uark.edu
Pseudocode No The paper describes methods and processes with equations but does not include any clearly labeled "Pseudocode" or "Algorithm" blocks.
Open Source Code No Project page: https://uark-aicv.github.io/HENASY. Furthermore, we will release our Git Hub implementation soon.
Open Datasets Yes Publicly available massive-scale egocentric datasets such as Ego4D [1] and Epic Kitchens-100 [2], providing suites of egocentric tasks, have further sparked even more interest within the research community. HENASY is trained on Ego Clip [3], which contains 3.8M clip-narration pairs covering a sub-set of 2,927 video hours from Ego4D [1].
Dataset Splits No The paper mentions specific datasets and evaluation protocols (e.g., zero-shot transfer) but does not provide explicit train/validation/test split percentages or sample counts for the datasets used for training or fine-tuning.
Hardware Specification Yes We train HENASY on two A6000 GPUs, in 5 epochs with Adam W optimizer [37] at fixed learning rate of 3e 5, and with batch size of 128.
Software Dependencies No The paper does not provide specific version numbers for software dependencies such as Python, PyTorch, or other libraries, only mentioning "Adam W optimizer [37]" and "Llama-2 [34]".
Experiment Setup Yes Training. HENASY is trained on Ego Clip [3], which contains 3.8M clip-narration pairs covering a sub-set of 2,927 video hours from Ego4D [1]. For each video clip, we uniformly sample 4 frames. We employ the pre-extracted narration s nouns and pre-detected hand and object bounding boxes from [4] for NEC loss and projection loss, respectively. For verb phrases, we employ Llama-2 [34] with a prompt as discussed in Appendix C. The loss weights in Eq. 10 are set as: λ1 = 0.5, λ2 = 0.5, λ3 = 1.0. We train HENASY on two A6000 GPUs, in 5 epochs with Adam W optimizer [37] at fixed learning rate of 3e 5, and with batch size of 128.