Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Approximately Aligned Decoding

Authors: Daniel Melcer, Sujan Kumar Gonugondla, Pramuditha Perera, Haifeng Qian, Wen-Hao Chiang, Yanjun Wang, Nihal Jain, Pranav Garg, Xiaofei Ma, Anoop Deoras

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We show through a series of experiments that the task-specific performance of Apr AD is comparable to methods that do not distort the output distribution, while being much more computationally efficient. We run a series of experiments, demonstrating that our method obtains excellent task-specific performance on both synthetic and real-world domains, without introducing an unreasonable level of inference overhead.
Researcher Affiliation	Collaboration	Daniel Melcer Khoury College of Computer Sciences Northeastern University Boston, MA, USA EMAIL Sujan Gonugondla Meta Superintelligence Labs New York, NY, USA EMAIL Pramuditha Perera AWS NGDE New York, NY, USA EMAIL Haifeng Qian Nvidia Santa Clara, CA, USA EMAIL Wen-Hao Chiang AWS NGDE New York, NY, USA EMAIL Yanjun Wang AWS NGDE New York, NY, USA EMAIL Nihal Jain AWS NGDE New York, NY, USA EMAIL Pranav Garg AWS NGDE New York, NY, USA EMAIL Xiaofei Ma AWS NGDE New York, NY, USA EMAIL Anoop Deoras AWS NGDE New York, NY, USA EMAIL
Pseudocode	Yes	Algorithm 1 Generation with an autoregressive model procedure Generate(P, x1...n) Initial x1...n is the prompt while Stopping condition not met do Typically special EOS token, and length limit Sample one token xn+1 P( \|x) Increment n return x
Open Source Code	Yes	Work performed while at Amazon Code available at https://github.com/amazon-science/Approximately-Aligned-Decoding.
Open Datasets	Yes	We use Mistral-7B-Instruct-v0.2 [11] to generate text, where generation of a given vowel is considered an error. We evaluate the effectiveness of each error-free sampling method on a code generation task, where the generator avoids API hallucinations. [...] The methods are compared based on their performance on Big Code Bench v0.1 [36], a benchmark that focuses on practical programming tasks, often requiring the use of common libraries.
Dataset Splits	Yes	We provide the following prompts to the language model, as well as the relevant special tokens to delimit user instructions and chat turns. 1. Write a story without using the letter [A/E/I/O/U] . 2. Describe elephants without using the letter [A/E/I/O/U] . 3. Provide instructions to tie a tie without using the letter [A/E/I/O/U] . 4. Critique the Mona Lisa without using the letter [A/E/I/O/U] . 5. Summarize the history of artificial intelligence without using the letter [A/E/I/O/U] . Each prompt is combined with each vowel, resulting in 25 prompts.
Hardware Specification	Yes	This took about a day or two on an AWS p4d.24xlarge instance.
Software Dependencies	No	The paper mentions using Mistral-7B-Instruct-v0.2 [11], Pyright language server [20], and Starcoder2 [17], but does not provide specific version numbers for ancillary software like Python, PyTorch, or the Pyright server itself. Pyright is cited to a GitHub repository, but no explicit version number is stated in the text.
Experiment Setup	Yes	During sampling, we use a top-k of 20, and temperature of 0.8. 200 tokens was chosen as short enough to be quickly read by the human raters, and long enough to discern the sample quality. 2000 tokens was chosen as 10 times the output length, to prevent infinite computation. ... For all sampling methods, we use Starcoder2 [17], in the 7B and 15B model sizes. We generate 5 samples for each task, with temperature 0.8, and a top-p of 0.95.