Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

LogicTree: Improving Complex Reasoning of LLMs via Instantiated Multi-step Synthetic Logical Data

Authors: Zehao Wang, Lin F. Yang, Jie Wang, Kehan Wang, Hanzhu Chen, Bin Wang, Jianye Hao, Defu Lian, Bin Li, Enhong Chen

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on multiple benchmarks demonstrate that our approach achieves an average improvement of 9.4% in accuracy for LLMs on complex logical reasoning tasks. ... To evaluate the effectiveness of our Logic Tree, we design a suite of experiments that not only demonstrate its significant impact on enhancing LLMs reasoning abilities, but also provide in-depth analytical insights.
Researcher Affiliation Collaboration Zehao Wang1 , Lin Yang2, Jie Wang1 , Kehan Wang1, Hanzhu Chen1, Bin Wang2 Jianye Hao2,3, Defu Lian1, Bin Li1, Enhong Chen1 1Mo E Key Laboratory of Brain Science and Education, Psychological and Cognition, University of Science and Technology of China 2Noah s Ark Lab, Huawei Technologies, 3 Tianjin University
Pseudocode Yes Algorithm 1 Logical Reasoning Tree Generation Process
Open Source Code No Justification: we use open-source datasets, and the code used in this paper will be made publicly available upon acceptance.
Open Datasets Yes Evaluation Benchmarks. To evaluate the effectiveness of our proposed Logic Tree, we consider a diverse set of reasoning tasks and datasets that require rigorous logical reasoning: (a) Logic Bench[33]: a novel task designed to comprehensively evaluate the model s performance on each inference rule. (b) Logi QA2.0[19]: A collection of challenging real-world logical reasoning problems from civil service entrance exams. (c) Three BIG-Bench Hard (BBH)[38] tasks of varying difficulty levels: logical deduction with three, five, and seven objects. (d) FOLIO[13]: An expert-written, logically complex dataset for first-order logic reasoning. (e) Two AGIEval[55] tasks focused on logical reasoning: LAST-AR and LAST-LR. (f) Multi-Logi Eval[35]: A comprehensive dataset incorporating varying reasoning depths for logical complexity.
Dataset Splits Yes Syntheised Training Dataset. We generated 5,000 symbolic logic trees with depths from 2 to 15, and instantiated each into 3 semantically diverse scenarios, yielding 15,000 reasoning problems. After applying an automatic filtering process that discarded 8.73% of noisy or invalid samples, the final dataset for LLM training contains 13.8k high-quality, multi-step reasoning instances. ... (d) FOLIO[13]: ... For our experiment, we use all validation data, leveraging its dual-format structure to precisely assess models ability to interpret and reason with formal logical constructs.
Hardware Specification No We utilize models from the Llama-3.1, Mistral-v0.3, Qwen2.5, and Deepseek-R1-Distill families, with parameter scales ranging from 1.5B to 70B. We employ two distinct fine-tuning strategies: full fine-tuning is applied to smaller-scale models (under 8B), while the larger 70B model utilizes Lo RA fine-tuning. ... Deep Speed with gradient checkpointing and BF16 precision is used for efficient memory usage.
Software Dependencies No Deep Speed with gradient checkpointing and BF16 precision is used for efficient memory usage.
Experiment Setup Yes Llama-3.1-8B, Mistral-7B-v0.3 and Qwen2.5-7B are both trained with a learning rate of 1e-6. Qwen2.5-1.5B and Qwen2.5-3B are trained with a learning rate of 3e-6. Llama-3.1-70B, due to its Lo RA fine-tuning method, is trained with a higher learning rate of 2e-5. The training utilizes a maximum context length of 4096 tokens, a global batch size of 128, and is conducted for 3 epochs.