Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Reasoning-as-Logic-Units: Scaling Test-Time Reasoning in Large Language Models Through Logic Unit Alignment
Authors: Cheryl Li, Tianyuan Xu, Steven Y. Guo
ICML 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments demonstrate that Ra LU significantly outperforms existing baselines in mathematical reasoning (GSM8K, MATH) and algorithmic reasoning (Human Eval+, MBPP+), underscoring its potential to advance LLM reasoning and programming by offering enhanced accuracy and interpretability. We evaluate Ra LU on four benchmarks, including two for mathematical reasoning (GSM8K (Cobbe et al., 2021b), MATH (Hendrycks et al., 2021)) and the other two for code reasoning: Human Eval (Chen et al., 2021), Mb PP (Austin et al., 2021), and their plus versions (Zhong et al., 2024). The evaluation involves three LLM backbones: Deepseek V3, Llama3.3-70B-Instruct and Qwen2.5-72B-Instruct. Experimental results show that Ra LU achieves a significant improvement in final answer accuracies or pass@1 compared with best-performing baselines, with specific improvement of 1.22%, 2.07%, 6.60%, and 2.17% on these four benchmarks, respectively. We further perform an extensive ablation study to demonstrate the contributions to our key design in Ra LU. |
| Researcher Affiliation | Collaboration | Cheryl Li 1 Tianyuan Xu 2 Yiwen Guo 1 1Independent Researcher, Beijing, China 2Peking University, Beijing, China. Correspondence to: Yiwen Guo <EMAIL>. |
| Pseudocode | No | The paper describes the method and its stages (Logic Unit Extraction, Logic Unit Alignment, Solution Synthesis) in natural language, supported by illustrative code snippets and logical flow descriptions (Figure 3, Figure 6, and Appendix A.1). However, it does not contain a formal, clearly labeled 'Pseudocode' or 'Algorithm' block with structured steps for the entire Ra LU framework. |
| Open Source Code | Yes | Our code is available at https://github.com/Deep Accept/Ra LU. |
| Open Datasets | Yes | We evaluate Ra LU on four benchmarks, including two for mathematical reasoning: GSM8K (Cobbe et al., 2021b), MATH (Hendrycks et al., 2021), and AQUA (Ling et al., 2017), as well as the other two for code reasoning, Human Eval (Chen et al., 2021) and Mbpp (Austin et al., 2021), along with their extended versions with more test cases (Liu et al., 2023b). |
| Dataset Splits | Yes | We evaluate Ra LU on the whole test set except MATH. Due to resource limitation, we follow (Miao et al., 2024) to use a subset of MATH (named by MATH-np) taken from (Ling et al., 2023) 1. We use the metrics of the answer accuracy and pass@1 score for math and code-reasoning, respectively. |
| Hardware Specification | No | The paper mentions deploying Ra LU on specific LLMs (Deepseek-V3, Qwen2.5-72B-Instruct, Llama3.3-70B-Instruct) and discusses computational resource conservation but does not specify the underlying hardware (e.g., GPU models, CPU types) used for running the experiments. |
| Software Dependencies | No | The paper mentions using specific LLMs (Deepseek-V3, Qwen2.5-72B-Instruct, Llama3.3-70B-Instruct) but does not provide details on other ancillary software components, such as programming languages, libraries, or frameworks, along with their specific version numbers. It specifies experimental parameters like temperature and frequency penalty, but these are not software versions. |
| Experiment Setup | Yes | We set the maximum number for self-correction turns as 3 and the maximum number of candidate solutions/branches as 10 for Self-Consistency and To T. The temperature parameter is set to 0.7, and the frequency penalty is 0.3 in all experiments. |