Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Large Language Models Meet Symbolic Provers for Logical Reasoning Evaluation

Authors: Chengwen Qi, Ren Ma, Bowen Li, he du, Binyuan Hui, Jinwang Wu, Yuanjun Laili, Conghui He

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our evaluation shows that state-of-the-art LLMs struggle to solve Prover QA problems, even with Co T prompting, highlighting the dataset s challenging nature. We also finetune Llama3.1-8B-Instruct on a separate training set generated by our framework. The finetuned model demonstrates consistent improvements on both in-distribution and out-of-distribution test sets, suggesting the value of our proposed data generation framework.
Researcher Affiliation	Academia	1Beihang University 2Shanghai Artificial Intelligence Laboratory 3Fudan University 4Zhongguancun Laboratory 5State Key Laboratory of Intelligent Manufacturing Systems Technology, Beijing EMAIL EMAIL
Pseudocode	No	Our proposed generation framework, which integrates the generative capabilities of LLMs with the precision of a symbolic prover, generates FOL reasoning problems through a structured threestage process, as illustrated in Figure 1.
Open Source Code	Yes	Code available at: https://github.com/opendatalab/Prover Gen
Open Datasets	Yes	To address these limitations, we propose a novel framework called Prover Gen that synergizes the generative strengths of Large Language Models (LLMs) with the rigor and precision of symbolic provers, enabling the creation of a scalable, diverse, and high-quality FOL reasoning dataset, Prover QA. Prover QA is also distinguished by its inclusion of accessible and logically coherent intermediate reasoning steps for each problem. We utilize Llama3.1-70B-Instruct throughout our proposed generation framework to develop a new FOL reasoning benchmark, Prover QA, comprising 1,500 instances. For our submission, we have uploaded the entirety of the source code as a zipped file that has been properly anonymized. The source code contains inline documentation that details purpose and usage of different parts of the codebase. In addition, we also include the full set of model s responses and the evaluation script. As discussed in the ethics statement, we plan to more formally release Prover QA to the public as an open source repository with thorough details that describes the framewrok, outlines the code, and details its usage.
Dataset Splits	Yes	The benchmark is evenly divided into three subsets based on the length of the reasoning chains: easy (1-2 steps), medium (3-5 steps), and hard (6-9 steps). To create the Prover QA train set, we started by generating 5,000 instances for each difficulty level: easy, medium, and hard. We then applied data augmentation techniques to expand the dataset. This process expanded our generated data pool to 50,000 instances... we sample 5,000 instances from the generated data pool as the final training set for finetuning experiments.
Hardware Specification	No	No specific hardware details (GPU/CPU models, memory, cloud instances) are provided in the paper.
Software Dependencies	No	The paper mentions specific LLM models used (e.g., Llama3.1-70B-Instruct) and the symbolic prover Prover9, but does not provide specific version numbers for general software dependencies or programming languages (e.g., Python 3.x, PyTorch 1.x).
Experiment Setup	Yes	We utilize two prompting strategies: Standard and Co T. Standard prompting uses 2-shot in-context learning to prompt LLMs to answer questions directly, whereas Co T prompting employs 2-shot examples to instruct the model to solve questions step by step. Detailed training configurations are provided in Appendix F. For all finetuning experiments, we experimented with different training hyper-parameters, such as training epochs, learning rate and training set size, and selected the best configurations on held-out validation sets.