Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Are Language Models Efficient Reasoners? A Perspective from Logic Programming

Authors: Andreas Opedal, Yanick Zengaffinen, Haruki Shirakami, Clemente Pasti, Mrinmaya Sachan, Abulhair Saparov, Ryan Cotterell, Bernhard Schölkopf

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirically, we construct a dataset of math word problems injected with various number of irrelevant axioms that vary in semantic overlap with the goal theorem. We find that current LMs show marked accuracy declines under such conditions even with minimal, domainconsistent distractions and the proofs they generate frequently exhibit detours through irrelevant inferences.2
Researcher Affiliation Academia αETH Zürich βMPI for Intelligent Systems, Tübingen γEPFL δIdiap Research Institute ϵPurdue University
Pseudocode Yes Pseudocode for the forwardchaining is given in Alg. 1 in App. A.2. As a meta-strategy, each instance of forward chaining defines an ordering over proof steps. Different orderings give rise to familiar search algorithms, such as depthfirst search (DFS; Tarjan, 1972), breadth-first search (BFS; Moore, 1959), Dijkstra s (1959) algorithm, and heuristic-based search (Pearl, 1984) like A* (Hart et al., 1968). While DFS and BFS ignore information about the goal theorem, such information can guide search more efficiently, as exemplified by goal-aware strategies like Earley s (1970) algorithm for context-free parsing or its more general equivalent in logic programming magic templates (Bancilhon et al., 1986; Ramakrishnan, 1991; Eisner & Blatz, 2007).
Open Source Code Yes 2Code for this project is available at https://github.com/rycolab/reasoning-efficiency.
Open Datasets Yes Empirically, we construct verbalized logic programs for grade school math word problems (GSM problems; Cobbe et al., 2021; Li et al., 2024; Zhang et al., 2024), adopting methods from Opedal et al. (2025). Our primary experimental manipulative is the injection of irrelevant axioms into these GSM programs, which yields many possible implications that are irrelevant to a goal theorem of interest.
Dataset Splits No We generate 500 problems with varying structural, agent and entity overlap, as discussed above. See App. D.1 for more details on the make-up of the dataset. We refer to the problems without any irrelevant axioms as base problems. We additionally generate control problems that have the same number of axioms as the problems with irrelevant axioms, except that all axioms are relevant.
Hardware Specification No The compute required to reproduce our experiments is not outside of the ordinary typically accessed by academic labs. Inference on all models we consider is also available through APIs.
Software Dependencies No The paper refers to specific language models by name and version (e.g., Llama-3.1-8B-Instruct, Qwen2.5-Math-7B-Instruct) but does not list programming languages, libraries, or frameworks with their version numbers that would be needed to run the experiments, beyond the models themselves.
Experiment Setup Yes We use one vanilla LM and three reasoning LMs in our experiments: Llama-3.1-8B-Instruct (Llama Team, 2024), Qwen2.5-Math-7B-Instruct (Yang et al., 2024), Qwen s reasoning model Qw Q-32B, and Deep Seek-R1 (Deep Seek-AI, 2025). We generate strings using ancestral sampling, restricting the context length to 4000 tokens. Our experimental design is based on in-context learning (Brown et al., 2020). Specifically, we use five fixed in-context examples of shortest proofs to expose the LM to proofs in our verbalized logic programs.