Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Constrained Sampling for Language Models Should Be Easy: An MCMC Perspective

Authors: Emmanuel Anaya Gonzalez, Sairam Vaidya, Kanghee Park, Ruyi Ji, Taylor Berg-Kirkpatrick, Loris D'Antoni

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirically, our sampler outperforms existing methods on both synthetic benchmarks and real-world program fuzzing tasks 1. We first evaluate the different sampling methods on four grammar-constrained generation tasks proposed by Park et al. [43] and Lipkin et al. [37]. In Sec. 4.2 we show that MCMC, when compared to alternative sampling techniques, improves the quality of seeds needed to bootstrap fuzzing algorithms.
Researcher Affiliation	Academia	Emmanuel Anaya Gonzalez1 Sairam Vaidya1 Kanghee Park1 Ruyi Ji2 Taylor Berg-Kirkpatrick1 Loris D Antoni1 1UCSD 2 Peking University EMAIL EMAIL
Pseudocode	Yes	Algorithm 1: The Metropolis-Hastings algorithm instantiated for grammar-aligned sampling. Data: the LM P, the grammar G, a parameter k denoting the chain length, and a configurable distribution pw POS for sampling a random prefix from a given sequence. Result: a proposed sequence s .
Open Source Code	Yes	1Code available at https://github.com/large-loris-models/casa
Open Datasets	Yes	We first evaluate the different sampling methods on four grammar-constrained generation tasks proposed by Park et al. [43] and Lipkin et al. [37]. From [43], two of our tasks involve synthesizing expressions in an extension of linear integer arithmetic (SLIA) and loop invariants with bit-vector arithmetic (BV4). Our third task is the constituency parsing (CP) task already used in prior GCD work [19] where the grammar is used to help the model produce well-parenthesized parse trees for English sentences. For our fourth domain, we draw the Molecular Synthesis (MS) task from [37]
Dataset Splits	No	For each individual task, MCMC variant, and number of steps, we obtain 100 samples and use bootstrapping [16] to report mean KL divergence and 95% confidence interval. Second, to evaluate the trade-off between seed quality and quantity, we vary the number of initial seeds (N {50, 100, 200, 500}) for compute-matched comparisons.
Hardware Specification	Yes	Our experiments were conducted on Ubuntu 22.04 LTS nodes with Intel Xeon Gold 6230 CPUs (2.10 GHz, 10 cores, 20 threads allocated) and 384 GB RAM. For GPU-accelerated workloads, we provisioned 2x NVIDIA RTX A6000 GPUs. SMC+AWRS (k = 10) was evaluated on 1 H100 GPU (120GB) due to higher memory requirements.
Software Dependencies	Yes	Our implementation is based on Python 3.10.12, Py Torch 2.6.0+cu124, AFL++ 4.00c and LLVM 14.0.0.
Experiment Setup	Yes	For language-model decoding, we set temperature to 1.0, top-p to 1.0, and top-k to 0 to allow sampling from the full token vocabulary. We limited the maximum number of newly generated tokens to 512 for XML, and 1024 for SQL .test scripts. Each (benchmark, method) pair was evaluated in N = 5 independent, single-instance AFL++ runs of exactly 21600 s (six hours). We set AFL_RANDOM_SEED to 42 + i, (i = 1...5) for reproducibility and configure standard environment variables to ensure noninteractive execution. All other AFL++ parameters remained at defaults to isolate the impact of seed corpus quality.