Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Corrector Sampling in Language Models

Authors: Itai Gat, Neta Shaul, Uriel Singer, Yaron Lipman

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We fine-tuned a pretrained 8B AR model for 10% of its final training iterations and compared it to the fully-trained pretrained model to find RPT pretraining and sampling provides 5%-10% relative improvements in common benchmarks. Our contributions include: (3) Demonstrating empirically that RPT outperforms standard NTP sampling in both reasoning and coding tasks, as well as in a controlled error analysis. We evaluate our method on popular coding and reasoning benchmarks: Human Eval+ (Chen et al., 2021; Liu et al., 2023), MBPP (Austin et al., 2021), GSM8K (Cobbe et al., 2021)... Table 1 summarizes the results of our experiments, comparing the performance of RPT sampling to NTP on several baselines.
Researcher Affiliation	Academia	The paper does not provide explicit institutional affiliations or email domains for the authors. The authors are listed as: Itai Gat, Neta Shaul, Uriel Singer, Yaron Lipman. Without this information, it is not possible to classify the affiliation type.
Pseudocode	Yes	Algorithm 1 Resample-Previous-Tokens (RPT) training 1: Input: dataset D; pretrained or initialized model ˆfθ with params θ0 2: Hyperparameters: Probabilities s, q (0, 1); window size w 2; number of iterations m 3: θ θ0 Initialize parameters 4: for iteration i = 1 to m do 5: Draw x D Draw a training sample 6: Set σ = (1, 2, . . . , n 1) The identity permutation 7: Set τ = (2, 3, . . . , n) Next token index 8: With probability s permute σ using q and w See training in section 2.1 9: Compute τ Use equation 10 10: X = (xσ, σ, τ) Set the input to the network 11: Y = xτ Set the target 12: L LCE( ˆfθ(X), Y ) Evaluate cross-entropy loss, equation 13 13: θ optimize(L) Update θ with optimization step 14: end for
Open Source Code	No	Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [No] Justification: Implementation is straightforward and general to AR LLMs.
Open Datasets	Yes	We evaluate our method on popular coding and reasoning benchmarks: Human Eval+ (Chen et al., 2021; Liu et al., 2023) is a task where the model is required to complete a given a function signature with a docstring. MBPP (Austin et al., 2021) contains few-shot code generation tasks from problem descriptions. GSM8K (Cobbe et al., 2021) consists of grade-school-level mathematical word problems. Finally, we report results on Multi PL-E, a non-Pythonic version of Human Eval (Ben Allal et al., 2022).
Dataset Splits	Yes	Our data consists of a corpus of one trillion (1T) language tokens. Throughout all experiments, we use the same training dataset and maintain the same data order. We pretrained an autoregressive model with 8B parameters, using the standard cross entropy loss (i.e., equation 13 with σi = i and τi = i + 1) and the same architectural design as in Meta (2024) on this 1T token data for 240K iterations as our baseline, denoted AR-F. We denote its 224K iteration checkpoint (i.e., after 90% of the training tokens) by AR-C. We next finetuned AR-C to reach 240K iteration and the remaining 100B tokens (10% of total training tokens)... In Table 2 we report empirical TV distances computed with 128K validation set tokens from each dataset, all of them with context of at-least 20 tokens...
Hardware Specification	Yes	We train on 256 H100 GPUs and a batch size of 4M tokens.
Software Dependencies	No	The paper mentions using the "Adam W optimizer" but does not specify version numbers for any software libraries (e.g., Python, PyTorch, TensorFlow) or specific versions of CUDA.
Experiment Setup	Yes	We next finetuned AR-C to reach 240K iteration and the remaining 100B tokens (10% of total training tokens) with m = 16K iterations in Algorithm 1 with window size w = 3, and hyper-parameters s = 0.5 and q = 0.02, which corresponds to 80 expected swaps in each sequences of n = 4096 tokens (equation 9). We train on 256 H100 GPUs and a batch size of 4M tokens. We use Adam W optimizer with a warmup of 2000 steps, a peak learning rate of 1e-3 and a cosine scheduler.