Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Gaze-VLM: Bridging Gaze and VLMs through Attention Regularization for Egocentric Understanding

Authors: Anupam Pani, Yanchao Yang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results show that our approach improves semantic prediction scores by up to 11% for future event prediction and around 7% for current activity understanding, compared to the corresponding baseline models trained without gaze regularization. These results highlight the value of gaze-guided training in improving the accuracy and robustness of egocentric VLMs. Overall, this work establishes a foundation for using human gaze to enhance the predictive capabilities of VLMs in real-world scenarios like assistive robots and human-machine collaboration. Code and additional information is available at: https://github.com/anupampani/Gaze-VLM
Researcher Affiliation	Academia	Anupam Pani1 Yanchao Yang1,2 1HKU Musketeers Foundation Institute of Data Science, The University of Hong Kong 2Department of Electrical and Electronic Engineering, The University of Hong Kong EMAIL, EMAIL
Pseudocode	Yes	Algorithm 1: Iterative Prompt Refinement for Fine-Grained Annotation Input: Validation image subset V = {V1, . . . , Vm}; Full image set I = {I1, . . . , Ik}; Initial prompt P0; GPT-4V model; Chat GPT interface Output: Final refined prompt P ; Annotations {T1, . . . , Tk}
Open Source Code	Yes	Code and additional information is available at: https://github.com/anupampani/Gaze-VLM
Open Datasets	Yes	We adapt the Ego4D dataset (Grauman et al., 2022) to construct a training set suitable for egocentric activity understanding and future prediction using gaze-regularized VLMs. Ego4D provides egocentric video clips with synchronized eye-tracking data, which we leverage through the following processing steps:
Dataset Splits	Yes	Our experiments utilize Ego4D clips with gaze annotations, which provide the necessary ground truth for both training and evaluation. The primary experimental comparison contrasts our gaze-regularized models with a base model that uses only RGB inputs and standard attention, without any gaze supervision or alignment during training.
Hardware Specification	Yes	Both the base model and the gaze-regularized model were trained using two NVIDIA A800 80GB GPU cards. For initial experiments, we used the Open Flamingo architecture to develop and evaluate our approach. The base Open Flamingo model required approximately 36 38 hours to train, while the corresponding gaze-regularized version took around 50 hours.
Software Dependencies	No	Data loading was managed using the Web Dataset loader, with datasets converted to .tar format for compatibility with both Web Dataset and FSDP. After validating our method with Open Flamingo, we extended the same training pipeline to other architectures such as Intern VL, La Vi La, and Open LLa VA. In each case, the integration of our gaze-regularized component followed the same principle: it was inserted immediately after the visual encoder and before the language decoder, allowing for modular modulation of attention without disrupting the rest of the model architecture.
Experiment Setup	Yes	Training was conducted with a batch size of 32 and a learning rate of 7 10 5 over 10 epochs. The vision encoders were kept frozen and pre-trained. To accelerate training, we employed Fully Sharded Data Parallel (FSDP), which efficiently distributes model parameters and gradients across GPUs, improving memory usage and speed.