Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Constrained Sampling for Language Models Should Be Easy: An MCMC Perspective
Authors: Emmanuel Anaya Gonzalez, Sairam Vaidya, Kanghee Park, Ruyi Ji, Taylor Berg-Kirkpatrick, Loris D'Antoni
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirically, our sampler outperforms existing methods on both synthetic benchmarks and real-world program fuzzing tasks 1. We first evaluate the different sampling methods on four grammar-constrained generation tasks proposed by Park et al. [43] and Lipkin et al. [37]. In Sec. 4.2 we show that MCMC, when compared to alternative sampling techniques, improves the quality of seeds needed to bootstrap fuzzing algorithms. |
| Researcher Affiliation | Academia | Emmanuel Anaya Gonzalez1 Sairam Vaidya1 Kanghee Park1 Ruyi Ji2 Taylor Berg-Kirkpatrick1 Loris D Antoni1 1UCSD 2 Peking University EMAIL EMAIL |
| Pseudocode | Yes | Algorithm 1: The Metropolis-Hastings algorithm instantiated for grammar-aligned sampling. Data: the LM P, the grammar G, a parameter k denoting the chain length, and a configurable distribution pw POS for sampling a random prefix from a given sequence. Result: a proposed sequence s . |
| Open Source Code | Yes | 1Code available at https://github.com/large-loris-models/casa |
| Open Datasets | Yes | We first evaluate the different sampling methods on four grammar-constrained generation tasks proposed by Park et al. [43] and Lipkin et al. [37]. From [43], two of our tasks involve synthesizing expressions in an extension of linear integer arithmetic (SLIA) and loop invariants with bit-vector arithmetic (BV4). Our third task is the constituency parsing (CP) task already used in prior GCD work [19] where the grammar is used to help the model produce well-parenthesized parse trees for English sentences. For our fourth domain, we draw the Molecular Synthesis (MS) task from [37] |
| Dataset Splits | No | For each individual task, MCMC variant, and number of steps, we obtain 100 samples and use bootstrapping [16] to report mean KL divergence and 95% confidence interval. Second, to evaluate the trade-off between seed quality and quantity, we vary the number of initial seeds (N {50, 100, 200, 500}) for compute-matched comparisons. |
| Hardware Specification | Yes | Our experiments were conducted on Ubuntu 22.04 LTS nodes with Intel Xeon Gold 6230 CPUs (2.10 GHz, 10 cores, 20 threads allocated) and 384 GB RAM. For GPU-accelerated workloads, we provisioned 2x NVIDIA RTX A6000 GPUs. SMC+AWRS (k = 10) was evaluated on 1 H100 GPU (120GB) due to higher memory requirements. |
| Software Dependencies | Yes | Our implementation is based on Python 3.10.12, Py Torch 2.6.0+cu124, AFL++ 4.00c and LLVM 14.0.0. |
| Experiment Setup | Yes | For language-model decoding, we set temperature to 1.0, top-p to 1.0, and top-k to 0 to allow sampling from the full token vocabulary. We limited the maximum number of newly generated tokens to 512 for XML, and 1024 for SQL .test scripts. Each (benchmark, method) pair was evaluated in N = 5 independent, single-instance AFL++ runs of exactly 21600 s (six hours). We set AFL_RANDOM_SEED to 42 + i, (i = 1...5) for reproducibility and configure standard environment variables to ensure noninteractive execution. All other AFL++ parameters remained at defaults to isolate the impact of seed corpus quality. |