Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Large Language Models Meet Symbolic Provers for Logical Reasoning Evaluation
Authors: Chengwen Qi, Ren Ma, Bowen Li, he du, Binyuan Hui, Jinwang Wu, Yuanjun Laili, Conghui He
ICLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our evaluation shows that state-of-the-art LLMs struggle to solve Prover QA problems, even with Co T prompting, highlighting the dataset s challenging nature. We also finetune Llama3.1-8B-Instruct on a separate training set generated by our framework. The finetuned model demonstrates consistent improvements on both in-distribution and out-of-distribution test sets, suggesting the value of our proposed data generation framework. |
| Researcher Affiliation | Academia | 1Beihang University 2Shanghai Artificial Intelligence Laboratory 3Fudan University 4Zhongguancun Laboratory 5State Key Laboratory of Intelligent Manufacturing Systems Technology, Beijing EMAIL EMAIL |
| Pseudocode | No | Our proposed generation framework, which integrates the generative capabilities of LLMs with the precision of a symbolic prover, generates FOL reasoning problems through a structured threestage process, as illustrated in Figure 1. |
| Open Source Code | Yes | Code available at: https://github.com/opendatalab/Prover Gen |
| Open Datasets | Yes | To address these limitations, we propose a novel framework called Prover Gen that synergizes the generative strengths of Large Language Models (LLMs) with the rigor and precision of symbolic provers, enabling the creation of a scalable, diverse, and high-quality FOL reasoning dataset, Prover QA. Prover QA is also distinguished by its inclusion of accessible and logically coherent intermediate reasoning steps for each problem. We utilize Llama3.1-70B-Instruct throughout our proposed generation framework to develop a new FOL reasoning benchmark, Prover QA, comprising 1,500 instances. For our submission, we have uploaded the entirety of the source code as a zipped file that has been properly anonymized. The source code contains inline documentation that details purpose and usage of different parts of the codebase. In addition, we also include the full set of model s responses and the evaluation script. As discussed in the ethics statement, we plan to more formally release Prover QA to the public as an open source repository with thorough details that describes the framewrok, outlines the code, and details its usage. |
| Dataset Splits | Yes | The benchmark is evenly divided into three subsets based on the length of the reasoning chains: easy (1-2 steps), medium (3-5 steps), and hard (6-9 steps). To create the Prover QA train set, we started by generating 5,000 instances for each difficulty level: easy, medium, and hard. We then applied data augmentation techniques to expand the dataset. This process expanded our generated data pool to 50,000 instances... we sample 5,000 instances from the generated data pool as the final training set for finetuning experiments. |
| Hardware Specification | No | No specific hardware details (GPU/CPU models, memory, cloud instances) are provided in the paper. |
| Software Dependencies | No | The paper mentions specific LLM models used (e.g., Llama3.1-70B-Instruct) and the symbolic prover Prover9, but does not provide specific version numbers for general software dependencies or programming languages (e.g., Python 3.x, PyTorch 1.x). |
| Experiment Setup | Yes | We utilize two prompting strategies: Standard and Co T. Standard prompting uses 2-shot in-context learning to prompt LLMs to answer questions directly, whereas Co T prompting employs 2-shot examples to instruct the model to solve questions step by step. Detailed training configurations are provided in Appendix F. For all finetuning experiments, we experimented with different training hyper-parameters, such as training epochs, learning rate and training set size, and selected the best configurations on held-out validation sets. |