Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

CRANE: Reasoning with constrained LLM generation

Authors: Debangshu Banerjee, Tarun Suresh, Shubham Ugare, Sasa Misailovic, Gagandeep Singh

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments on multiple open-source LLMs and benchmarks show that CRANE significantly outperforms both state-of-the-art constrained decoding strategies and standard unconstrained decoding, showing up to 10% points accuracy improvement over baselines on challenging symbolic reasoning benchmarks GSM-symbolic and FOLIO. ... In this section, we evaluate CRANE on a math reasoning task (GSM-Symbolic (Mirzadeh et al., 2024)) and a logical reasoning task (FOLIO (Han et al., 2024)) and demonstrate significant improvement over both unconstrained and SOTA constrained generation baselines.
Researcher Affiliation	Academia	1Department of Computer Science, University of Illinois Urbana-Champaign, USA. Correspondence to: Debangshu Banerjee <EMAIL>.
Pseudocode	Yes	Algorithm 1 CRANE Algorithm
Open Source Code	No	The paper does not explicitly state that the code for the methodology is open-source, nor does it provide a direct link to a code repository. It mentions CRANE is implemented using Py Torch and Hugging Face transformers, which are third-party libraries, but not the specific code for CRANE.
Open Datasets	Yes	We evaluate CRANE on a math reasoning task (GSM-Symbolic (Mirzadeh et al., 2024)) and a logical reasoning task (FOLIO (Han et al., 2024)).
Dataset Splits	Yes	We further evaluate CRANE on the validation split of FOLIO dataset... We use 2 few-shot examples in the prompt.
Hardware Specification	Yes	Experimental Setup. We run experiments on a 48-core Intel Xeon Silver 4214R CPU with 2 NVidia RTX A5000 GPUs.
Software Dependencies	No	The paper mentions using Py Torch (Paszke et al., 2019), Hugging Face transformers library (Wolf et al., 2020), Z3 solver (De Moura & Bjørner, 2008), ITERGEN library (Ugare et al., 2024a), and SYNCODE framework (Han et al., 2024). However, it provides citations to papers describing these tools/libraries rather than specific software version numbers required for reproduction (e.g., PyTorch 1.9, Z3 v4.8.10).
Experiment Setup	Yes	We run greedy decoding with a maximum new token limit of 600 and prompt the LLMs with the 8-shot examples from GSM-Symbolic... For ITERGEN and CRANE, we enforce syntactic constraints via the context-free grammar provided in Appendix D.5.1 and apply the semantic constraint... For CRANE, we use << and >> for the delimeters S1 and S2, respectively. ... For all approaches and models, we run greedy decoding with a maximum new tokens limit of 800 and use 2 few-shot examples in the prompt.