Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

A Implies B: Circuit Analysis in LLMs for Propositional Logical Reasoning

Authors: Guan Zhe Hong, Nishanth Dikkala, Enming Luo, Cyrus Rashtchian, Xin Wang, Rina Panigrahy

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	By studying this problem on Mistral and Gemma models, up to 27B parameters, we illuminate the core components the models use to solve such logic problems. From a mechanistic interpretability point of view, we use causal mediation analysis to uncover the pathways and components of the LLMs reasoning processes. Mistral and Gemma models only achieve 70% to 86% accuracy in writing the correct proof and determining the query value, even with few-shot prompting.
Researcher Affiliation	Collaboration	1Purdue University 2 Google Research EMAIL, EMAIL
Pseudocode	No	The paper describes steps and strategies in paragraph text and diagrams (e.g., Figure 1, Figure 30), but does not contain explicitly labeled 'Pseudocode' or 'Algorithm' blocks with structured, code-like formatting.
Open Source Code	Yes	Our code is available at https://github.com/guanzhehong/prop-logic-transformer-circuit
Open Datasets	No	We study a minimal propositional logic problem that requires combining multiple facts to arrive at a solution. The problem involves five boolean variables. Given two propositions as Rules and truth values of three variables as Facts we wish to infer the unknown truth value of a different variable (which we call the query ). To generate the problem, we use the template and then sample variables and truth values.
Dataset Splits	Yes	Specifically, we tested the models on 400 samples. Mistral achieved 96% accuracy when QUERY is for the linear chain, and 70% accuracy when QUERY is for the OR chain (so they average above 70% accuracy). In the training set, the linear chain is queried 20% of the time; the Log Op chain is queried 80% of the time during training. We train every model on 2 million samples.
Hardware Specification	Yes	Each model is trained on a single V100 GPU; the full set of models take around 2 3 days to finish training.
Software Dependencies	No	We use the Adam W optimizer in Py Torch, with 5k iterations of linear warmup, followed by cosine annealing to a learning rate of 0.
Experiment Setup	Yes	We query pre-trained LLMs in a few-shot manner on propositional logic problems defined in the introduction. We show 4 or 6 examples of questions and their minimal proofs (see Appendix B.1 for details). Then, we append a new problem that asks for the truth value of one variable... We set the learning rate to 5e-5, and weight decay to 1e-4. We use a batch size of 512, and train the model for 60k iterations. We use the Adam W optimizer in Py Torch, with 5k iterations of linear warmup, followed by cosine annealing to a learning rate of 0.