Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Logic.py: Bridging the Gap between LLMs and Constraint Solvers

Authors: Pascal Kesseli, Peter O'Hearn, Ricardo S. Cabral

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate the efficacy of this approach on benchmarks like the logic puzzles tasks in Zebra Logic Bench. Instead of letting the LLM attempt to directly solve the puzzles, our method prompts the model to formalise the problem in a logic-focused, human-readable, domain-specific language (DSL) called Logic.py. This formalised representation is then solved using a constraint solver, leveraging the strengths of both the language model and the solver. Our approach achieves a remarkable 65% absolute improvement over the baseline performance of Llama 3.1 70B on Zebra Logic Bench, increasing its accuracy to over 90%.
Researcher Affiliation Industry Pascal Kesseli Meta EMAIL Peter O Hearn Meta EMAIL Ricardo Silveira Cabral Meta EMAIL
Pseudocode Yes Fig. 3 provides an example of the kind of data structure the LLM will generate. class House: house_number: Unique[Domain[int , range(1, 7)]] name: Unique[Domain[str , "Alice", "Eric", ...]] # ... class Puzzle Solution: houses: list[House , 6] Figure 3: Model Output Example: Result Data Structure
Open Source Code Yes We provide the full prompts in our open source agent project polymath2. 2https://github.com/facebookresearch/polymath
Open Datasets Yes We evaluate the efficacy of this approach on the logic puzzle benchmark Zebra Logic Bench Lin et al. [2024] and the first order logic benchmark FOLIO Han et al. [2022].
Dataset Splits Yes To evaluate the effectiveness of our approach, we conducted experiments on Zebra Logic Bench, a benchmark suite consisting of 1000 logic grid tasks. In order to run the evaluation independently, researchers must request access to a private dataset hosted huggingface.co.
Hardware Specification Yes We run the logic agent implementation on a developer server with 56 cores and 114GB RAM. We self-hosted Llama 3.1 70B on a clust wither 700 GPUs.
Software Dependencies No We use lib CST1 to transform Logic.py to C for analysis by CBMC, as explained in more detail in Sec. 5.
Experiment Setup Yes As mentioned in Sec. referror-recovery, our experimental setup includes restarting on UNSAT, but not on ambiguity. Removing this restart on UNSAT logic did not lead to any measurable decrease in score across our experimental evaluation runs. In general, we restart at most five times in case of formalisation syntax errors or UNSAT results before aborting. Independently, we also retry at most five times if we receive an HTTP error from our LLM inference API, e.g. due to throttling.