Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Reasoning-as-Logic-Units: Scaling Test-Time Reasoning in Large Language Models Through Logic Unit Alignment

Authors: Cheryl Li, Tianyuan Xu, Steven Y. Guo

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments demonstrate that Ra LU significantly outperforms existing baselines in mathematical reasoning (GSM8K, MATH) and algorithmic reasoning (Human Eval+, MBPP+), underscoring its potential to advance LLM reasoning and programming by offering enhanced accuracy and interpretability. We evaluate Ra LU on four benchmarks, including two for mathematical reasoning (GSM8K (Cobbe et al., 2021b), MATH (Hendrycks et al., 2021)) and the other two for code reasoning: Human Eval (Chen et al., 2021), Mb PP (Austin et al., 2021), and their plus versions (Zhong et al., 2024). The evaluation involves three LLM backbones: Deepseek V3, Llama3.3-70B-Instruct and Qwen2.5-72B-Instruct. Experimental results show that Ra LU achieves a significant improvement in final answer accuracies or pass@1 compared with best-performing baselines, with specific improvement of 1.22%, 2.07%, 6.60%, and 2.17% on these four benchmarks, respectively. We further perform an extensive ablation study to demonstrate the contributions to our key design in Ra LU.
Researcher Affiliation	Collaboration	Cheryl Li 1 Tianyuan Xu 2 Yiwen Guo 1 1Independent Researcher, Beijing, China 2Peking University, Beijing, China. Correspondence to: Yiwen Guo <EMAIL>.
Pseudocode	No	The paper describes the method and its stages (Logic Unit Extraction, Logic Unit Alignment, Solution Synthesis) in natural language, supported by illustrative code snippets and logical flow descriptions (Figure 3, Figure 6, and Appendix A.1). However, it does not contain a formal, clearly labeled 'Pseudocode' or 'Algorithm' block with structured steps for the entire Ra LU framework.
Open Source Code	Yes	Our code is available at https://github.com/Deep Accept/Ra LU.
Open Datasets	Yes	We evaluate Ra LU on four benchmarks, including two for mathematical reasoning: GSM8K (Cobbe et al., 2021b), MATH (Hendrycks et al., 2021), and AQUA (Ling et al., 2017), as well as the other two for code reasoning, Human Eval (Chen et al., 2021) and Mbpp (Austin et al., 2021), along with their extended versions with more test cases (Liu et al., 2023b).
Dataset Splits	Yes	We evaluate Ra LU on the whole test set except MATH. Due to resource limitation, we follow (Miao et al., 2024) to use a subset of MATH (named by MATH-np) taken from (Ling et al., 2023) 1. We use the metrics of the answer accuracy and pass@1 score for math and code-reasoning, respectively.
Hardware Specification	No	The paper mentions deploying Ra LU on specific LLMs (Deepseek-V3, Qwen2.5-72B-Instruct, Llama3.3-70B-Instruct) and discusses computational resource conservation but does not specify the underlying hardware (e.g., GPU models, CPU types) used for running the experiments.
Software Dependencies	No	The paper mentions using specific LLMs (Deepseek-V3, Qwen2.5-72B-Instruct, Llama3.3-70B-Instruct) but does not provide details on other ancillary software components, such as programming languages, libraries, or frameworks, along with their specific version numbers. It specifies experimental parameters like temperature and frequency penalty, but these are not software versions.
Experiment Setup	Yes	We set the maximum number for self-correction turns as 3 and the maximum number of candidate solutions/branches as 10 for Self-Consistency and To T. The temperature parameter is set to 0.7, and the frequency penalty is 0.3 in all experiments.