Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Are Language Models Efficient Reasoners? A Perspective from Logic Programming
Authors: Andreas Opedal, Yanick Zengaffinen, Haruki Shirakami, Clemente Pasti, Mrinmaya Sachan, Abulhair Saparov, Ryan Cotterell, Bernhard Schölkopf
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirically, we construct a dataset of math word problems injected with various number of irrelevant axioms that vary in semantic overlap with the goal theorem. We find that current LMs show marked accuracy declines under such conditions even with minimal, domainconsistent distractions and the proofs they generate frequently exhibit detours through irrelevant inferences.2 |
| Researcher Affiliation | Academia | αETH Zürich βMPI for Intelligent Systems, Tübingen γEPFL δIdiap Research Institute ϵPurdue University |
| Pseudocode | Yes | Pseudocode for the forwardchaining is given in Alg. 1 in App. A.2. As a meta-strategy, each instance of forward chaining defines an ordering over proof steps. Different orderings give rise to familiar search algorithms, such as depthfirst search (DFS; Tarjan, 1972), breadth-first search (BFS; Moore, 1959), Dijkstra s (1959) algorithm, and heuristic-based search (Pearl, 1984) like A* (Hart et al., 1968). While DFS and BFS ignore information about the goal theorem, such information can guide search more efficiently, as exemplified by goal-aware strategies like Earley s (1970) algorithm for context-free parsing or its more general equivalent in logic programming magic templates (Bancilhon et al., 1986; Ramakrishnan, 1991; Eisner & Blatz, 2007). |
| Open Source Code | Yes | 2Code for this project is available at https://github.com/rycolab/reasoning-efficiency. |
| Open Datasets | Yes | Empirically, we construct verbalized logic programs for grade school math word problems (GSM problems; Cobbe et al., 2021; Li et al., 2024; Zhang et al., 2024), adopting methods from Opedal et al. (2025). Our primary experimental manipulative is the injection of irrelevant axioms into these GSM programs, which yields many possible implications that are irrelevant to a goal theorem of interest. |
| Dataset Splits | No | We generate 500 problems with varying structural, agent and entity overlap, as discussed above. See App. D.1 for more details on the make-up of the dataset. We refer to the problems without any irrelevant axioms as base problems. We additionally generate control problems that have the same number of axioms as the problems with irrelevant axioms, except that all axioms are relevant. |
| Hardware Specification | No | The compute required to reproduce our experiments is not outside of the ordinary typically accessed by academic labs. Inference on all models we consider is also available through APIs. |
| Software Dependencies | No | The paper refers to specific language models by name and version (e.g., Llama-3.1-8B-Instruct, Qwen2.5-Math-7B-Instruct) but does not list programming languages, libraries, or frameworks with their version numbers that would be needed to run the experiments, beyond the models themselves. |
| Experiment Setup | Yes | We use one vanilla LM and three reasoning LMs in our experiments: Llama-3.1-8B-Instruct (Llama Team, 2024), Qwen2.5-Math-7B-Instruct (Yang et al., 2024), Qwen s reasoning model Qw Q-32B, and Deep Seek-R1 (Deep Seek-AI, 2025). We generate strings using ancestral sampling, restricting the context length to 4000 tokens. Our experimental design is based on in-context learning (Brown et al., 2020). Specifically, we use five fixed in-context examples of shortest proofs to expose the LM to proofs in our verbalized logic programs. |