Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Glance2Gaze: Efficient Vision-Language Models from Glance Fusion to Gaze Compression
Authors: Juan Chen, Honglin liu, Yingying Ao, Ting Zhang, Yan Huang, Xudong Liu, Biao Li, Jintao Fang
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results on widely adopted benchmarks demonstrate that Glance2Gaze outperforms existing methods, achieving superior performance with equal or lower computational cost. Furthermore, it generalizes well to high-resolution and video scenarios, showcasing robust and scalable efficiency improvements in VLMs. We conduct extensive experiments to evaluate the effectiveness and efficiency of our proposed framework. Empirical results demonstrate that it consistently outperforms state-of-the-art baselines on both image and video understanding tasks, while maintaining comparable computational efficiency on the LLa VA backbone series. |
| Researcher Affiliation | Collaboration | 1School of Computer Science and Engineering, South China University of Technology 2Meituan Inc. 3School of Artificial Intelligence, Beijing Normal University |
| Pseudocode | No | The paper describes the methodology in detailed text and diagrams (e.g., Figure 2), but it does not contain any explicitly labeled 'Pseudocode' or 'Algorithm' blocks or sections. |
| Open Source Code | No | Our code will be released upon paper s acceptance. |
| Open Datasets | Yes | To validate the effectiveness of our method in image understanding tasks, we conducted experiments on ten mainstream datasets, including Text VQA [44], POPE [45], GQA [46], VQAv2 [47], SEEDBench [48], MMBench [49], MME [50], Science QA-IMG [51], MMVet [52] and LLa VA-Bench-in-the-wild [2]. Beyond image understanding, we applied the proposed method to video comprehension tasks, evaluating it on four widely adopted video-based question answering benchmarks: TGIF [54], MSVD [55], MSRVTT [55], and Activity Net [56]. |
| Dataset Splits | Yes | For LLa VA, the vision encoder was frozen while the remaining parameters were fine-tuned using the LLa VA-665k [3] dataset, adhering to the original training settings. For LLa VA-Ne XT, all parameters were unfrozen during fine-tuning. Given its proprietary code and training data, we used the Open-LLa VA-Ne XT [53], an open-source replication, for training, following PDrop [21]. ... we employ a two-stage training strategy...training only the projector on the LLa VA-558K dataset... utilizing the same dataset as in [53]. |
| Hardware Specification | Yes | All experiments are conducted on 8 NVIDIA-A100-80G GPUs. |
| Software Dependencies | No | The paper mentions several models and frameworks like LLa VA, Vicuna, CLIP-ViT, Flash Attn, but it does not provide specific version numbers for any software libraries, programming languages, or environments used for implementation. |
| Experiment Setup | Yes | In the Glance Fusion module, L is set to {7, 13, 19, 23}. We implement the Gaze Compression strategy under different compression ratios. ... we employ a two-stage training strategy for Glance2Gaze. In the first stage, we align image-text pairs by retaining the LLa VA architecture and training only the projector on the LLa VA-558K dataset for one epoch with a batch size of 256 and a learning rate of 1e-3. In the second stage, we incorporate Glance Fusion and Gaze Compression into LLa VA, training all parameters except the visual encoder for one epoch with a batch size of 128 and a learning rate of 2e-5. ...For Gaze Compression, we adjust r and pr to manage different computation configurations, as detailed in Table 7 and 8. |