Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Causal Sufficiency and Necessity Improves Chain-of-Thought Reasoning
Authors: Xiangning Yu, Zhuohan Wang, Linyi Yang, Haoxuan Li, Anjie Liu, Xiao Xue, Jun Wang, Mengyue Yang
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experimental results on various mathematical and commonsense reasoning benchmarks confirm substantial improvements in reasoning efficiency and reduced token usage without sacrificing accuracy. Our work provides a promising direction for improving LLM reasoning performance and cost-effectiveness. The code is available at: https://github.com/yxn9191/causalmath. |
| Researcher Affiliation | Academia | 1Tianjin University 2City University of Hong Kong 3University College London 4Peking University 5The Hong Kong University of Science and Technology (Guangzhou) 6University of Bristol |
| Pseudocode | Yes | Algorithm 1: Sufficient and Necessary Optimization of Co T |
| Open Source Code | Yes | The code is available at: https://github.com/yxn9191/causalmath. |
| Open Datasets | Yes | Empirical evaluations on mathematical reasoning benchmarks including GSM-8k [10], MATH-500 [25], and AIME [44], as well as the Commonsense QA [53] dataset confirm that our approach significantly reduces reasoning redundancy while maintaining or improving prediction accuracy. |
| Dataset Splits | No | We evaluate on diverse reasoning benchmarks to ensure robustness across domains and difficulty levels. For mathematical reasoning, we use: (1) GSM-8k [10], with grade-school problems; (2) MATH-500 [25], covering intermediate-level topics; and (3) AIME, with advanced competition problems up to 2025 [44, 11]. For commonsense reasoning, we use Commonsense QA [53], a multiple-choice dataset requiring everyday inference. We investigate enhancing LLM performance with optimized Co T data via in-context learning (ICL) and supervised fine-tuning (SFT). We fine-tune reasoning models on 1,229 PNSselected Co T traces from MATH [25], MMLU [24], Zebra Logic Bench [34], Commonsense QA [53], and AIME (pre-2024) [44]. |
| Hardware Specification | Yes | All SFT training was conducted on 8 NVIDIA RTX 3090 GPUs using the Ze RO-3 optimizer for efficient memory distribution. |
| Software Dependencies | No | The training used the flash_attention_2 implementation for efficient attention computation, combined with a cosine learning rate scheduler that decays to a minimum learning rate. Each GPU was assigned a batch size of 1 due to the large context length of 16,384 tokens. The model was trained for 3 epochs, and max_steps was left as -1 to allow epoch-based termination. These settings balance computational feasibility and performance under long-context, reasoning-intensive tasks. The same configuration was applied across all target models, including Deep Seek-R1-Qwen-1.5B, Deep Scale R-1.5B-Preview, and Phi-4-mini-reasoning, unless otherwise specified. |
| Experiment Setup | Yes | Table 4: General SFT Hyperparameters. Hardware: 8 NVIDIA RTX 3090 GPUs, Ze RO-3 optimizer, bf16 mixed precision. Parameter Value attn_implementation flash_attention_2 bf16 true learning_rate 5.0e-05 lr_scheduler_type cosine_with_min_lr per_device_train_batch_size 1 max_steps -1 max_length 16384 num_train_epochs 3 |