Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Causal Sufficiency and Necessity Improves Chain-of-Thought Reasoning

Authors: Xiangning Yu, Zhuohan Wang, Linyi Yang, Haoxuan Li, Anjie Liu, Xiao Xue, Jun Wang, Mengyue Yang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experimental results on various mathematical and commonsense reasoning benchmarks conﬁrm substantial improvements in reasoning efﬁciency and reduced token usage without sacriﬁcing accuracy. Our work provides a promising direction for improving LLM reasoning performance and cost-effectiveness. The code is available at: https://github.com/yxn9191/causalmath.
Researcher Affiliation	Academia	1Tianjin University 2City University of Hong Kong 3University College London 4Peking University 5The Hong Kong University of Science and Technology (Guangzhou) 6University of Bristol
Pseudocode	Yes	Algorithm 1: Sufﬁcient and Necessary Optimization of Co T
Open Source Code	Yes	The code is available at: https://github.com/yxn9191/causalmath.
Open Datasets	Yes	Empirical evaluations on mathematical reasoning benchmarks including GSM-8k [10], MATH-500 [25], and AIME [44], as well as the Commonsense QA [53] dataset conﬁrm that our approach signiﬁcantly reduces reasoning redundancy while maintaining or improving prediction accuracy.
Dataset Splits	No	We evaluate on diverse reasoning benchmarks to ensure robustness across domains and difﬁculty levels. For mathematical reasoning, we use: (1) GSM-8k [10], with grade-school problems; (2) MATH-500 [25], covering intermediate-level topics; and (3) AIME, with advanced competition problems up to 2025 [44, 11]. For commonsense reasoning, we use Commonsense QA [53], a multiple-choice dataset requiring everyday inference. We investigate enhancing LLM performance with optimized Co T data via in-context learning (ICL) and supervised ﬁne-tuning (SFT). We ﬁne-tune reasoning models on 1,229 PNSselected Co T traces from MATH [25], MMLU [24], Zebra Logic Bench [34], Commonsense QA [53], and AIME (pre-2024) [44].
Hardware Specification	Yes	All SFT training was conducted on 8 NVIDIA RTX 3090 GPUs using the Ze RO-3 optimizer for efﬁcient memory distribution.
Software Dependencies	No	The training used the flash_attention_2 implementation for efﬁcient attention computation, combined with a cosine learning rate scheduler that decays to a minimum learning rate. Each GPU was assigned a batch size of 1 due to the large context length of 16,384 tokens. The model was trained for 3 epochs, and max_steps was left as -1 to allow epoch-based termination. These settings balance computational feasibility and performance under long-context, reasoning-intensive tasks. The same conﬁguration was applied across all target models, including Deep Seek-R1-Qwen-1.5B, Deep Scale R-1.5B-Preview, and Phi-4-mini-reasoning, unless otherwise speciﬁed.
Experiment Setup	Yes	Table 4: General SFT Hyperparameters. Hardware: 8 NVIDIA RTX 3090 GPUs, Ze RO-3 optimizer, bf16 mixed precision. Parameter Value attn_implementation ﬂash_attention_2 bf16 true learning_rate 5.0e-05 lr_scheduler_type cosine_with_min_lr per_device_train_batch_size 1 max_steps -1 max_length 16384 num_train_epochs 3