Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Gaze-VLM: Bridging Gaze and VLMs through Attention Regularization for Egocentric Understanding

Authors: Anupam Pani, Yanchao Yang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results show that our approach improves semantic prediction scores by up to 11% for future event prediction and around 7% for current activity understanding, compared to the corresponding baseline models trained without gaze regularization. These results highlight the value of gaze-guided training in improving the accuracy and robustness of egocentric VLMs. Overall, this work establishes a foundation for using human gaze to enhance the predictive capabilities of VLMs in real-world scenarios like assistive robots and human-machine collaboration. Code and additional information is available at: https://github.com/anupampani/Gaze-VLM
Researcher Affiliation Academia Anupam Pani1 Yanchao Yang1,2 1HKU Musketeers Foundation Institute of Data Science, The University of Hong Kong 2Department of Electrical and Electronic Engineering, The University of Hong Kong EMAIL, EMAIL
Pseudocode Yes Algorithm 1: Iterative Prompt Refinement for Fine-Grained Annotation Input: Validation image subset V = {V1, . . . , Vm}; Full image set I = {I1, . . . , Ik}; Initial prompt P0; GPT-4V model; Chat GPT interface Output: Final refined prompt P ; Annotations {T1, . . . , Tk}
Open Source Code Yes Code and additional information is available at: https://github.com/anupampani/Gaze-VLM
Open Datasets Yes We adapt the Ego4D dataset (Grauman et al., 2022) to construct a training set suitable for egocentric activity understanding and future prediction using gaze-regularized VLMs. Ego4D provides egocentric video clips with synchronized eye-tracking data, which we leverage through the following processing steps:
Dataset Splits Yes Our experiments utilize Ego4D clips with gaze annotations, which provide the necessary ground truth for both training and evaluation. The primary experimental comparison contrasts our gaze-regularized models with a base model that uses only RGB inputs and standard attention, without any gaze supervision or alignment during training.
Hardware Specification Yes Both the base model and the gaze-regularized model were trained using two NVIDIA A800 80GB GPU cards. For initial experiments, we used the Open Flamingo architecture to develop and evaluate our approach. The base Open Flamingo model required approximately 36 38 hours to train, while the corresponding gaze-regularized version took around 50 hours.
Software Dependencies No Data loading was managed using the Web Dataset loader, with datasets converted to .tar format for compatibility with both Web Dataset and FSDP. After validating our method with Open Flamingo, we extended the same training pipeline to other architectures such as Intern VL, La Vi La, and Open LLa VA. In each case, the integration of our gaze-regularized component followed the same principle: it was inserted immediately after the visual encoder and before the language decoder, allowing for modular modulation of attention without disrupting the rest of the model architecture.
Experiment Setup Yes Training was conducted with a batch size of 32 and a learning rate of 7 10 5 over 10 epochs. The vision encoders were kept frozen and pre-trained. To accelerate training, we employed Fully Sharded Data Parallel (FSDP), which efficiently distributes model parameters and gradients across GPUs, improving memory usage and speed.