Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Guiding LLMs The Right Way: Fast, Non-Invasive Constrained Generation
Authors: Luca Beurer-Kellner, Marc Fischer, Martin Vechev
ICML 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 4. Experimental Evaluation We evaluate DOMINO in terms of downstream task accuracy, compare its performance to multiple baselines and ablate key parameters such as k. |
| Researcher Affiliation | Academia | Luca Beurer-Kellner 1 Marc Fischer 1 Martin Vechev 1 1Department of Computer Science, ETH Zรผrich, Switzerland. |
| Pseudocode | Yes | Algorithm 1 Constrained Decoding Input: Checker C, LLM f, Tokenized Prompt x Output: Completion o adhering to C |
| Open Source Code | Yes | We release DOMINO as open source on Git Hub. |
| Open Datasets | Yes | Datasets We assess downstream accuracy of different constraining methods with the GSM8K (Cobbe et al., 2021) benchmark for math reasoning and CoNLL-2003 (Sang & De Meulder, 2003) for named-entity recognition (subset of 400 test samples). |
| Dataset Splits | Yes | For GSM8K and CoNLL-2003, we prompt and constrain the models to generate a response in a given JSON format... Our prompts consist of 5 few-shot demonstrations from the training split, for which we manually construct the corresponding JSON response. ... CoNLL-2003 for named-entity recognition (subset of 400 test samples). |
| Hardware Specification | Yes | As inference backends, we rely on both, transformers (Wolf et al., 2019) and llama.cpp (Gerganov & et. al.) on NVIDIA A100 40GB or H100 80GB GPUs. |
| Software Dependencies | No | The paper mentions software like 'transformers' and 'llama.cpp' but does not specify their version numbers. |
| Experiment Setup | Yes | Setup We run 100 repetitions per configuration. In each, we sample one of 5 different prompts per workload, and sample output of up to 128 tokens from the model, using a temperature value of 1.0. |