Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

CoFFT: Chain of Foresight-Focus Thought for Visual Language Models

Authors: Xinyu Zhang, Yuxuan Dong, Lingling Zhang, Chengyou Jia, Zhuohang Dang, Basura Fernando, Jun Liu, Mike Zheng Shou

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct extensive experiments across several benchmarks using different VLMs, including Qwen2.5-VL [1], Intern VL-2.5 [2], and Llava-Next [24]. Co FFT demonstrates consistent performance gains of 3.1-5.8% on average across these benchmarks. We conduct comprehensive evaluations across multiple complementary benchmarks to assess various aspects of visual reasoning capabilities.
Researcher Affiliation	Academia	1School of Computer Science and Technology, Xi an Jiaotong University 2Ministry of Education Key Laboratory of Intelligent Networks and Network Security, China 3Shaanxi Province Key Laboratory of Big Data Knowledge Engineering, China 4IHPC, Agency for Science, Technology and Research, Singapore 5Show Lab, National University of Singapore 6College of Computing and Data Science, Nanyang Technological University, Singapore
Pseudocode	Yes	To illustrate the workflow in Foresight-Focus Thought, we take the current t + 1 iteration of Foresight-Focus Thought as an example, given original image V , question Q, current visual focus image Vt, and existing reasoning process Rt = {r1, . . . , rt}, to introduce the following three stages, as shown in Algorithm 1.
Open Source Code	No	Yes, all codes will be open-sourced after the review process.
Open Datasets	Yes	Benchmarks We conduct comprehensive evaluations across multiple complementary benchmarks to assess various aspects of visual reasoning capabilities. For mathematical and geometric reasoning, we employ Math Vista [45] and Math Vision [46]. To evaluate cross-domain visual reasoning abilities, we utilize the multi-subject benchmarks M3Co T [47] and MMStar [48]. For chart comprehension assessment, we leverage Charxiv [14]. Additionally, we contribute to the geographical domain by introducing two novel datasets in Seek World [16]: Seek World-Global, which utilizes Google Maps panoramic imagery, and Seek World-China, which incorporates data from the Xiaohongshu App.
Dataset Splits	Yes	Benchmarks We conduct comprehensive evaluations across multiple complementary benchmarks to assess various aspects of visual reasoning capabilities. For mathematical and geometric reasoning, we employ Math Vista [45] and Math Vision [46]. To evaluate cross-domain visual reasoning abilities, we utilize the multi-subject benchmarks M3Co T [47] and MMStar [48]. For chart comprehension assessment, we leverage Charxiv [14]. Additionally, we contribute to the geographical domain by introducing two novel datasets in Seek World [16]: Seek World-Global, which utilizes Google Maps panoramic imagery, and Seek World-China, which incorporates data from the Xiaohongshu App. Performance Metrics We adopt Pass@1 accuracy (Acc.) as our primary performance metric across all benchmarks.
Hardware Specification	Yes	All experiments are run on four NVIDIA A100 GPUs with parallel processing.
Software Dependencies	No	Our experimental approach incorporates stateof-the-art Vision Language Models (VLMs): Qwen2.5-VL-Instruct (7B, 32B) [1], Intern VL2.5-Instruct (8B) [2], and Llava-Next (7B) [24], selected for their architectural capabilities and superior performance in visual reasoning.
Experiment Setup	Yes	To ensure sample diversity, the temperature parameter ranges from 0.4 to 1 with an interval of 0.1, and a sample is randomly selected each time it is generated. To prevent repeated selections, the probability weight of each chosen parameter is reduced by half in subsequent sampling processes. The weights are reset to their initial values once all parameters have been selected, ensuring a balanced exploration of different temperature values. For Predictive Decoding and Co FFT, inference is considered complete when the model outputs REASONING_COMPLETE . For our primary experiments, we select l = 5 and k = 4 to balance computational efficiency and performance. (From Section 4.4 Parameter Analysis).