Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Counterfactual Evolution of Multimodal Datasets via Visual Programming
Authors: Minghe Gao, Zhongqi Yue, Wenjie Yan, Yihao Hu, Wei Ji, Siliang Tang, Jun Xiao, Tat-Seng Chua, Yueting Zhuang, Juncheng Li
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments show that SCOPE improves reasoning performance, exposes model blind spots, and enhances visual dialog capabilities. |
| Researcher Affiliation | Academia | 1Zhejiang University 2National University of Singapore 3Nanyang Technological University 4Hainan University 5Nanjing University |
| Pseudocode | Yes | def execute_command(image): image_patch = lmage Patch(image) donut_patches = image_patch.find("donut") donut_count=len(donut_patches) return donut_count |
| Open Source Code | Yes | To promote transparency and reproducibility, we release the full SCOPE benchmark along with tools for program-based, controllable dataset expansion. |
| Open Datasets | Yes | The benchmark is built by applying the SCOPE framework to perform program-guided expansions over samples from six widely used vision-language datasets: SEED-Bench2 [26], MME [23], MM-Bench [31], GQA [21], OK-VQA [34], and Tally-QA [2]. |
| Dataset Splits | Yes | The final dataset is divided into SCOPE-Train and SCOPE-Test with a 70:30 ratio. |
| Hardware Specification | No | The paper does not explicitly describe the specific hardware (e.g., GPU models, CPU models, memory) used for running its experiments. It mentions models like Qwen2.5-VL-7B and Intern VL-2.5-2B, and GPT-4o for program generation, but not the underlying hardware these ran on. |
| Software Dependencies | No | The paper mentions using Python for visual programming and refers to an API library with functions like Image Patch, find, etc. It also names GPT-4o as a program generator. However, it does not specify version numbers for Python or any specific libraries (e.g., PyTorch, TensorFlow) or frameworks used for the experiments. |
| Experiment Setup | No | The paper describes the models trained (Qwen2.5-VL-7B, Intern VL-2.5-2B) and the curriculum learning strategy (MAP). It details evaluation metrics and benchmarks used. However, it does not provide specific experimental setup details such as learning rates, batch sizes, number of epochs, or optimizer settings for training these models on SCOPE-Train. |