Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Logic.py: Bridging the Gap between LLMs and Constraint Solvers
Authors: Pascal Kesseli, Peter O'Hearn, Ricardo S. Cabral
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate the efficacy of this approach on benchmarks like the logic puzzles tasks in Zebra Logic Bench. Instead of letting the LLM attempt to directly solve the puzzles, our method prompts the model to formalise the problem in a logic-focused, human-readable, domain-specific language (DSL) called Logic.py. This formalised representation is then solved using a constraint solver, leveraging the strengths of both the language model and the solver. Our approach achieves a remarkable 65% absolute improvement over the baseline performance of Llama 3.1 70B on Zebra Logic Bench, increasing its accuracy to over 90%. |
| Researcher Affiliation | Industry | Pascal Kesseli Meta EMAIL Peter O Hearn Meta EMAIL Ricardo Silveira Cabral Meta EMAIL |
| Pseudocode | Yes | Fig. 3 provides an example of the kind of data structure the LLM will generate. class House: house_number: Unique[Domain[int , range(1, 7)]] name: Unique[Domain[str , "Alice", "Eric", ...]] # ... class Puzzle Solution: houses: list[House , 6] Figure 3: Model Output Example: Result Data Structure |
| Open Source Code | Yes | We provide the full prompts in our open source agent project polymath2. 2https://github.com/facebookresearch/polymath |
| Open Datasets | Yes | We evaluate the efficacy of this approach on the logic puzzle benchmark Zebra Logic Bench Lin et al. [2024] and the first order logic benchmark FOLIO Han et al. [2022]. |
| Dataset Splits | Yes | To evaluate the effectiveness of our approach, we conducted experiments on Zebra Logic Bench, a benchmark suite consisting of 1000 logic grid tasks. In order to run the evaluation independently, researchers must request access to a private dataset hosted huggingface.co. |
| Hardware Specification | Yes | We run the logic agent implementation on a developer server with 56 cores and 114GB RAM. We self-hosted Llama 3.1 70B on a clust wither 700 GPUs. |
| Software Dependencies | No | We use lib CST1 to transform Logic.py to C for analysis by CBMC, as explained in more detail in Sec. 5. |
| Experiment Setup | Yes | As mentioned in Sec. referror-recovery, our experimental setup includes restarting on UNSAT, but not on ambiguity. Removing this restart on UNSAT logic did not lead to any measurable decrease in score across our experimental evaluation runs. In general, we restart at most five times in case of formalisation syntax errors or UNSAT results before aborting. Independently, we also retry at most five times if we receive an HTTP error from our LLM inference API, e.g. due to throttling. |