Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
HENASY: Learning to Assemble Scene-Entities for Interpretable Egocentric Video-Language Model
Authors: Khoa Vo, Thinh Phan, Kashu Yamazaki, Minh Tran, Ngan Le
NeurIPS 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our method demonstrates strong interpretability in both quantitative and qualitative experiments; while maintaining competitive performances on five downstream tasks via zero-shot transfer or as video/text representation, including video/text retrieval, action recognition, multi-choice query, natural language query, and moments query. |
| Researcher Affiliation | Academia | Khoa Vo Thinh Phan Kashu Yamazaki Minh Tran Ngan Le AICV Lab, University of Arkansas, Fayetteville, USA EMAIL |
| Pseudocode | No | The paper describes methods and processes with equations but does not include any clearly labeled "Pseudocode" or "Algorithm" blocks. |
| Open Source Code | No | Project page: https://uark-aicv.github.io/HENASY. Furthermore, we will release our Git Hub implementation soon. |
| Open Datasets | Yes | Publicly available massive-scale egocentric datasets such as Ego4D [1] and Epic Kitchens-100 [2], providing suites of egocentric tasks, have further sparked even more interest within the research community. HENASY is trained on Ego Clip [3], which contains 3.8M clip-narration pairs covering a sub-set of 2,927 video hours from Ego4D [1]. |
| Dataset Splits | No | The paper mentions specific datasets and evaluation protocols (e.g., zero-shot transfer) but does not provide explicit train/validation/test split percentages or sample counts for the datasets used for training or fine-tuning. |
| Hardware Specification | Yes | We train HENASY on two A6000 GPUs, in 5 epochs with Adam W optimizer [37] at fixed learning rate of 3e 5, and with batch size of 128. |
| Software Dependencies | No | The paper does not provide specific version numbers for software dependencies such as Python, PyTorch, or other libraries, only mentioning "Adam W optimizer [37]" and "Llama-2 [34]". |
| Experiment Setup | Yes | Training. HENASY is trained on Ego Clip [3], which contains 3.8M clip-narration pairs covering a sub-set of 2,927 video hours from Ego4D [1]. For each video clip, we uniformly sample 4 frames. We employ the pre-extracted narration s nouns and pre-detected hand and object bounding boxes from [4] for NEC loss and projection loss, respectively. For verb phrases, we employ Llama-2 [34] with a prompt as discussed in Appendix C. The loss weights in Eq. 10 are set as: λ1 = 0.5, λ2 = 0.5, λ3 = 1.0. We train HENASY on two A6000 GPUs, in 5 epochs with Adam W optimizer [37] at fixed learning rate of 3e 5, and with batch size of 128. |