Voila-A: Aligning Vision-Language Models with User's Gaze Attention

Authors: Kun Yan, Zeyu Wang, Lei Ji, Yuntao Wang, Nan Duan, Shuai Ma

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate Voila-A using a hold-out validation set and a newly collected VOILA-GAZE test set, which features real-life scenarios captured with a gaze-tracking device. Our experimental results demonstrate that Voila-A significantly outperforms several baseline models.
Researcher Affiliation Collaboration Kun Yan1 , Zeyu Wang2 , Lei Ji3, Yuntao Wang2, Nan Duan3, Shuai Ma1 1 SKLSDE Lab, Beihang University 2 Key Laboratory of Pervasive Computing, Tsinghua University 3 Microsoft Research
Pseudocode No The paper does not contain any clearly labeled 'Pseudocode' or 'Algorithm' blocks, nor does it present structured steps in a pseudocode-like format.
Open Source Code Yes Our code is available at https://github.com/naykun/Voila-A
Open Datasets Yes Table 1: Statistics of Voila-COCO and Voila-Gaze Datasets, SR refers to Survival Rate from raw data after filtering DATASET SPLIT #IMAGES #QUESTIONS SR VOILA-COCO TRAINING 20000 70000 93.5% VOILA-COCO VALIDATION 100 550 71.1% VOILA-COCO TEST 500 1900 75.7%
Dataset Splits Yes Table 1: Statistics of Voila-COCO and Voila-Gaze Datasets, SR refers to Survival Rate from raw data after filtering DATASET SPLIT #IMAGES #QUESTIONS SR VOILA-COCO TRAINING 20000 70000 93.5% VOILA-COCO VALIDATION 100 550 71.1% VOILA-COCO TEST 500 1900 75.7%
Hardware Specification No The paper mentions training models and optimizing, but does not specify the particular GPU or CPU models, memory, or other hardware components used for the experiments. Section F, which is cited for compute resources in the NeurIPS checklist, only details model configurations and training parameters, not hardware.
Software Dependencies Yes The text model is an instance of MPTFor Causal LM 7B and the vision model is based on the CLIP Vi T-L/14 [42] vision encoder. ... The tokenizer used is Eleuther AI/gpt-neox-20b. The model s torch data type is set to bfloat16.
Experiment Setup Yes For optimization, we employ the Adam W optimizer [22] with a starting learning rate of 1e-5 and a batch size of 4. We train Voila for three epochs, scheduling the learning rate using a cosine annealing scheduler. To prevent exploding gradients, we apply gradient clipping with a threshold of 1.0.