Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Logic.py: Bridging the Gap between LLMs and Constraint Solvers

Authors: Pascal Kesseli, Peter O'Hearn, Ricardo S. Cabral

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate the efﬁcacy of this approach on benchmarks like the logic puzzles tasks in Zebra Logic Bench. Instead of letting the LLM attempt to directly solve the puzzles, our method prompts the model to formalise the problem in a logic-focused, human-readable, domain-speciﬁc language (DSL) called Logic.py. This formalised representation is then solved using a constraint solver, leveraging the strengths of both the language model and the solver. Our approach achieves a remarkable 65% absolute improvement over the baseline performance of Llama 3.1 70B on Zebra Logic Bench, increasing its accuracy to over 90%.
Researcher Affiliation	Industry	Pascal Kesseli Meta EMAIL Peter O Hearn Meta EMAIL Ricardo Silveira Cabral Meta EMAIL
Pseudocode	Yes	Fig. 3 provides an example of the kind of data structure the LLM will generate. class House: house_number: Unique[Domain[int , range(1, 7)]] name: Unique[Domain[str , "Alice", "Eric", ...]] # ... class Puzzle Solution: houses: list[House , 6] Figure 3: Model Output Example: Result Data Structure
Open Source Code	Yes	We provide the full prompts in our open source agent project polymath2. 2https://github.com/facebookresearch/polymath
Open Datasets	Yes	We evaluate the efﬁcacy of this approach on the logic puzzle benchmark Zebra Logic Bench Lin et al. [2024] and the ﬁrst order logic benchmark FOLIO Han et al. [2022].
Dataset Splits	Yes	To evaluate the effectiveness of our approach, we conducted experiments on Zebra Logic Bench, a benchmark suite consisting of 1000 logic grid tasks. In order to run the evaluation independently, researchers must request access to a private dataset hosted huggingface.co.
Hardware Specification	Yes	We run the logic agent implementation on a developer server with 56 cores and 114GB RAM. We self-hosted Llama 3.1 70B on a clust wither 700 GPUs.
Software Dependencies	No	We use lib CST1 to transform Logic.py to C for analysis by CBMC, as explained in more detail in Sec. 5.
Experiment Setup	Yes	As mentioned in Sec. referror-recovery, our experimental setup includes restarting on UNSAT, but not on ambiguity. Removing this restart on UNSAT logic did not lead to any measurable decrease in score across our experimental evaluation runs. In general, we restart at most ﬁve times in case of formalisation syntax errors or UNSAT results before aborting. Independently, we also retry at most ﬁve times if we receive an HTTP error from our LLM inference API, e.g. due to throttling.