Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

ChartSketcher: Reasoning with Multimodal Feedback and Reflection for Chart Understanding

Authors: Muye Huang, Lingling Zhang, Jie Ma, Han Lai, Fangzhi Xu, Yifei Li, Wenjun Wu, Yaqiang Wu, Jun Liu

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments demonstrate that Chart Sketcher achieves promising performance on chart understanding benchmarks and general vision tasks, providing an interactive and interpretable approach to chart comprehension. We conduct extensive experiments across multiple datasets to demonstrate the effectiveness of Chart Sketcher. Through comprehensive ablation studies, we investigate the importance of each training stage and validate the contribution of key components in our work.
Researcher Affiliation	Collaboration	1School of Computer Science and Technology, Xi an Jiaotong University, Xi an, China 2Shannxi Province Key Laboratory of Big Data Knowledge Engineering, Xi an, China 3MOE KLINNS Lab, Xi an , China 4Zhongguancun Academy, Beijing, China 5Lenovo Research EMAIL EMAIL EMAIL
Pseudocode	Yes	Programmatic Sketching Library. To equip MLLMs with image sketching capabilities, we design a simple drawing language library. ... The pseudocode serves as a structured and concise language for defining geometric shapes and applying transformations. ... Algorithm 1 demonstrates the detailed workflow of Sketch-MCTS, along with comprehensive descriptions of its specific parameters.
Open Source Code	Yes	Code and data available at https://github.com/Muye Huang/Chart Sketcher.
Open Datasets	Yes	During the cold start phase, our base dataset images were sourced from Evo Chart Corpus [15], with seed questions from Chart QA [30] and Evo Chart-QA [15]. To ensure general capability, we incorporated 20% of Visual Co T [41] data into the training mix. For the RL phase, we conducted training across multiple datasets, including Chart QA, Chart Bench [56], and Visual Co T. ... We tested chart understanding capabilities on Chart QA and other datasets [36], and evaluated general performance on Openimages and other datasets [43, 34, 17, 20, 61, 64, 19]. Code and data available at https://github.com/Muye Huang/Chart Sketcher.
Dataset Splits	Yes	We tested chart understanding capabilities on Chart QA and other datasets [36], and evaluated general performance on Openimages and other datasets [43, 34, 17, 20, 61, 64, 19]. To ensure fair comparison, all experimental results reported in this paper are based on our local reproduction of baseline methods.
Hardware Specification	Yes	All experiments were run on two machines: an Atlas 800T A2 and 8 * A800-40G GPUs.
Software Dependencies	Yes	For model selection, we employed Qwen2.5-32B [40] to construct QA pairs and distill multimodal reasoning and reflection data. We used Deep Seek-Distill-Qwen-14B [7] as the value network for Sketch-MCTS, evaluating the correctness of final answers. We also trained a smaller, 2B version Chart Sketcher2B to facilitate its use in scenarios with limited computational resources. Chart Sketcher-72B and Chart Sketcher-2B were initialized with Qwen2VL-72B [50] and Qwen2VL-2B weights, respectively.
Experiment Setup	Yes	During the cold start phase, we trained Chart Sketcher for 4 epochs on data without reflection, followed by 1 epoch using RPO [60] loss on reflection data. To reduce computational costs, we employed Lo RA [13] training in the cold phase, with a Lo RA rank of 16, Alpha of 32, batch size of 64, and learning rate of 1e-4. The RPO ratio was set to 1.0. In the RL phase, we conducted KTO [9] training for 1 epoch, maintaining a Lo RA rank of 16 and Alpha of 32, while adjusting the batch size to 32 and reducing the learning rate to 1e-5. For the key parameters of MCTS, the maximum tree depth is 8, the maximum number of child nodes is 3, CPUCT = 3.0, the simulation count limit is 15, and the search exits after successfully finding 3 answers.