Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

A Implies B: Circuit Analysis in LLMs for Propositional Logical Reasoning

Authors: Guan Zhe Hong, Nishanth Dikkala, Enming Luo, Cyrus Rashtchian, Xin Wang, Rina Panigrahy

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental By studying this problem on Mistral and Gemma models, up to 27B parameters, we illuminate the core components the models use to solve such logic problems. From a mechanistic interpretability point of view, we use causal mediation analysis to uncover the pathways and components of the LLMs reasoning processes. Mistral and Gemma models only achieve 70% to 86% accuracy in writing the correct proof and determining the query value, even with few-shot prompting.
Researcher Affiliation Collaboration 1Purdue University 2 Google Research EMAIL, EMAIL
Pseudocode No The paper describes steps and strategies in paragraph text and diagrams (e.g., Figure 1, Figure 30), but does not contain explicitly labeled 'Pseudocode' or 'Algorithm' blocks with structured, code-like formatting.
Open Source Code Yes Our code is available at https://github.com/guanzhehong/prop-logic-transformer-circuit
Open Datasets No We study a minimal propositional logic problem that requires combining multiple facts to arrive at a solution. The problem involves five boolean variables. Given two propositions as Rules and truth values of three variables as Facts we wish to infer the unknown truth value of a different variable (which we call the query ). To generate the problem, we use the template and then sample variables and truth values.
Dataset Splits Yes Specifically, we tested the models on 400 samples. Mistral achieved 96% accuracy when QUERY is for the linear chain, and 70% accuracy when QUERY is for the OR chain (so they average above 70% accuracy). In the training set, the linear chain is queried 20% of the time; the Log Op chain is queried 80% of the time during training. We train every model on 2 million samples.
Hardware Specification Yes Each model is trained on a single V100 GPU; the full set of models take around 2 3 days to finish training.
Software Dependencies No We use the Adam W optimizer in Py Torch, with 5k iterations of linear warmup, followed by cosine annealing to a learning rate of 0.
Experiment Setup Yes We query pre-trained LLMs in a few-shot manner on propositional logic problems defined in the introduction. We show 4 or 6 examples of questions and their minimal proofs (see Appendix B.1 for details). Then, we append a new problem that asks for the truth value of one variable... We set the learning rate to 5e-5, and weight decay to 1e-4. We use a batch size of 512, and train the model for 60k iterations. We use the Adam W optimizer in Py Torch, with 5k iterations of linear warmup, followed by cosine annealing to a learning rate of 0.