Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

SIGMA: Refining Large Language Model Reasoning via Sibling-Guided Monte Carlo Augmentation

Authors: Yanwei Ren, Haotian Zhang, Fuxiang Wu, Jiayan Qiu, Jiaxing Huang, Baosheng Yu, Liu Liu

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	On the challenging MATH benchmark, our SIGMA-tuned 7B model achieves 54.92% accuracy using only 30K samples, outperforming state-of-the-art models trained on 590K samples. This result highlights that our sibling-guided optimization not only significantly reduces data usage but also significantly boosts LLM reasoning. ... Experiments show that this refinement significantly improves downstream performance...
Researcher Affiliation	Academia	1School of Artificial Intelligence, Beihang University 2Hangzhou International Innovation Institute, Beihang University 3Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences 4University of Leicester 5Nanyang Technological University *Corresponding author: EMAIL.
Pseudocode	No	The paper describes the SIGMA framework and its components using textual descriptions and diagrams (Figures 2, 3, 4) but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Code is available at https://github.com/frank130845/SIGMA.
Open Datasets	Yes	We adopt the Qwen2.5-Math-7B as the generation model to construct search trees based on prompts from MATH [20] and GSM8K [13]. ...The ID evaluation includes GSM8K [13] and MATH [20]. The OOD evaluation include four benchmarks: College Math [44], which contains 2,818 university-level problems spanning seven mathematical domains; Deep Mind Mathematics [40]... Olympiad Bench-Math [19]... and Theorem QA [9]...
Dataset Splits	Yes	To enhance diversity, we generate two MCTS datasets using decoding temperatures of 0.4 and 0.7, each contributing 15K examples to form a combined 30K training set. ... We follow the official DART-Math evaluation protocol [46] and reuse their publicly released test scripts to evaluate all models under a zero-shot greedy decoding setting. ... To evaluate both effectiveness and generalizability, we conduct experiments using the SIGMA-refined 15K and 30K training subsets.
Hardware Specification	Yes	We full-finetuned and evaluated those base models on 4 H100 GPUs. ... We use Qwen2.5-Math-7B for MCTS generation... requiring approximately 42 GPU hours on an RTX 4090... All fine-tuning experiments were carried out on four NVIDIA H100 GPUs with Deep Speed2 Ze RO [39] optimizations...
Software Dependencies	No	The paper mentions using AdamW [25] optimizer and Deep Speed2 Ze RO [39] optimizations, but does not provide specific version numbers for these or other software libraries like Python or PyTorch.
Experiment Setup	Yes	The computation sequence token length was fixed at 4096 to capture long range mathematical reasoning. We set per device train batch size=8 and used gradient accumulation steps=4 to accumulate gradients. All models were trained for 3 epochs. ... Initial learning rates were tuned per model: Deep Seek Math-7B: 5e-5, Mistral-7B: 4e-6, LLa MA3-8B: 1e-5.