Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

MINT-CoT: Enabling Interleaved Visual Tokens in Mathematical Chain-of-Thought Reasoning

Authors: Xinyan Chen, Renrui Zhang, Dongzhi JIANG, Aojun Zhou, Shilin Yan, Weifeng Lin, Hongsheng Li

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments demonstrate the effectiveness of our method for effective visual interleaved reasoning in mathematical domains, where MINTCo T-7B outperforms the baseline model by +34.08% on Math Vista, +28.78% on Geo QA, and +23.2% on MMStar, respectively.
Researcher Affiliation Academia 1CUHK MMLab 2Shanghai AI Laboratory 3CPII under Inno HK EMAIL EMAIL
Pseudocode No The paper describes methods and processes like the 'data generation pipeline' in Figure 3 and the 'progressive training strategy' in Section 3.3 using structured steps and flowcharts, but it does not contain a formal 'Pseudocode' or 'Algorithm' block.
Open Source Code Yes Our code and data are available at https://github.com/xinyan-cxy/MINT-Co T.
Open Datasets Yes To empower this capability, we construct the MINT-Co T dataset, containing 54K mathematical problems aligning each reasoning step with visual regions at the token level, accompanied by a rigorous data generation pipeline. ... Our code and data are available at https://github.com/xinyan-cxy/MINT-Co T.
Dataset Splits No The paper states that the MINT-CoT dataset is used for training and lists several external benchmarks for evaluation (Geo QA, Math Vista, MMStar) and which subsets were used. However, it does not explicitly provide the training/validation/test splits for their self-constructed MINT-CoT dataset itself (e.g., percentages or counts for different splits).
Hardware Specification No The paper does not explicitly describe the hardware used, such as specific GPU or CPU models, memory, or cloud computing resources. The Appendix A.4 'Additional Implementation Details' describes training parameters but not the hardware.
Software Dependencies No The paper mentions using 'Qwen2-VL-7B [64] as the base MLLM model' but does not provide specific version numbers for this or any other software dependencies.
Experiment Setup Yes The training procedure consists of three stages: (1) Text-only Co T Training, where we train for 2 epochs on the MINT-Co T dataset without applying the interleaving strategy, using a learning rate of 5.0e-6 and a batch size of 64, following the configuration of Mulberry [74]; (2) Interleaved Co T SFT, where we train for 3 epochs on the MINT-Co T dataset with a learning rate of 1e-6 and a batch size of 64; and (3) Interleaved Co T RL, where we train for 700 steps on the MINT-Co T dataset, using a group size G = 4, a weighting factor λ = 0.02, a learning rate of 1e-6 and a batch size of 16. ... We uniformly set the threshold θ = 0.7 to filter the similarity scores. The hyper-parameter γ to scale the similarity is set to 1/0.07 following CLIP [54].