Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
CRANE: Reasoning with constrained LLM generation
Authors: Debangshu Banerjee, Tarun Suresh, Shubham Ugare, Sasa Misailovic, Gagandeep Singh
ICML 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on multiple open-source LLMs and benchmarks show that CRANE significantly outperforms both state-of-the-art constrained decoding strategies and standard unconstrained decoding, showing up to 10% points accuracy improvement over baselines on challenging symbolic reasoning benchmarks GSM-symbolic and FOLIO. ... In this section, we evaluate CRANE on a math reasoning task (GSM-Symbolic (Mirzadeh et al., 2024)) and a logical reasoning task (FOLIO (Han et al., 2024)) and demonstrate significant improvement over both unconstrained and SOTA constrained generation baselines. |
| Researcher Affiliation | Academia | 1Department of Computer Science, University of Illinois Urbana-Champaign, USA. Correspondence to: Debangshu Banerjee <EMAIL>. |
| Pseudocode | Yes | Algorithm 1 CRANE Algorithm |
| Open Source Code | No | The paper does not explicitly state that the code for the methodology is open-source, nor does it provide a direct link to a code repository. It mentions CRANE is implemented using Py Torch and Hugging Face transformers, which are third-party libraries, but not the specific code for CRANE. |
| Open Datasets | Yes | We evaluate CRANE on a math reasoning task (GSM-Symbolic (Mirzadeh et al., 2024)) and a logical reasoning task (FOLIO (Han et al., 2024)). |
| Dataset Splits | Yes | We further evaluate CRANE on the validation split of FOLIO dataset... We use 2 few-shot examples in the prompt. |
| Hardware Specification | Yes | Experimental Setup. We run experiments on a 48-core Intel Xeon Silver 4214R CPU with 2 NVidia RTX A5000 GPUs. |
| Software Dependencies | No | The paper mentions using Py Torch (Paszke et al., 2019), Hugging Face transformers library (Wolf et al., 2020), Z3 solver (De Moura & Bjรธrner, 2008), ITERGEN library (Ugare et al., 2024a), and SYNCODE framework (Han et al., 2024). However, it provides citations to papers describing these tools/libraries rather than specific software version numbers required for reproduction (e.g., PyTorch 1.9, Z3 v4.8.10). |
| Experiment Setup | Yes | We run greedy decoding with a maximum new token limit of 600 and prompt the LLMs with the 8-shot examples from GSM-Symbolic... For ITERGEN and CRANE, we enforce syntactic constraints via the context-free grammar provided in Appendix D.5.1 and apply the semantic constraint... For CRANE, we use << and >> for the delimeters S1 and S2, respectively. ... For all approaches and models, we run greedy decoding with a maximum new tokens limit of 800 and use 2 few-shot examples in the prompt. |