Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

AutoJudge: Judge Decoding Without Manual Annotation

Authors: Roman Garipov, Fedor Velikonivtsev, Ivan Ermakov, Ruslan Svirschevski, Vage Egiazarian, Max Ryabinin

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate Auto Judge with multiple draft/target model pairs on mathematical reasoning and programming benchmarks, achieving significant speedups at the cost of a minor accuracy reduction. Notably, on GSM8K with the Llama 3.1 70B target model, our approach achieves up to 2 speedup over speculative decoding at the cost of a 1% drop in accuracy. When applied to the Live Code Bench benchmark, Auto Judge automatically detects programming-specific important tokens, accepting 25 tokens per speculation cycle at a 2% drop in Pass@1. Our experiments with Llama 3.x models demonstrate that the proposed approach can indeed identify important tokens and save time on speculation.
Researcher Affiliation	Collaboration	Roman Garipov HSE University, Yandex Fedor Velikonivtsev HSE University, Yandex Ivan Ermakov HSE University, Yandex Ruslan Svirschevski Yandex Vage Egiazarian IST Austria Max Ryabinin Together AI
Pseudocode	Yes	Algorithm 1 SEARCH FOR IMPORTANT TOKENS 1: Input: x: prompt, θdraft: draft model, θtarget: target model 2: Output: a sequence of M mismatches, labeled as important or unimportant
Open Source Code	Yes	Our code is available at github.com/garipovroma/autojudge.
Open Datasets	Yes	We evaluate Auto Judge with multiple draft/target model pairs on mathematical reasoning and programming benchmarks. Notably, on GSM8K with the Llama 3.1 70B target model, our approach achieves up to 2 speedup over speculative decoding at the cost of a 1% drop in accuracy. When applied to the Live Code Bench benchmark, Auto Judge automatically detects programming-specific important tokens, accepting 25 tokens per speculation cycle at a 2% drop in Pass@1.
Dataset Splits	Yes	Our first set of experiments is based on the GSM8K dataset with grade school mathematical problems. This dataset has a natural split with 7.47K training samples and 1.32K test samples. Following the standard evaluation procedure, we use the training set to mine important tokens with Algorithm 1 and train the classifier, then run inference and evaluate on the test set with the recommended parameters [Gao et al., 2021] for zero-shot and 8-shot evaluation: greedy inference with a prompt that encourages chain-of-thought reasoning. ... Since Live Code Bench does not have a dedicated training split, we evaluate using out-of-fold predictions. Namely, we split the dataset randomly into 5 folds. For each fold, we evaluate using the classifier trained on the 4 remaining folds. ... To compare different classifier configurations, we further divide the GSM8K training set into classifier training (90%) and validation (10%) subsets.
Hardware Specification	Yes	We integrate Auto Judge with the v LLM framework [Kwon et al., 2023] and report the inference speed on A100 GPUs for both 8B and 70B target models, with up to 2 speedup over speculative decoding at a 1% quality decrease, and on H100 GPUs with a 405B target model. ... We run 1B/8B model pair on a single A100-SXM4-80GB GPU; 8B/70B on 4 A100-SXM4-80G GPUs in tensor-parallel mode. Finally, the 8B/405B runs on 8 H100-SXM5-80GB GPUs with the 405B model loaded in FP8 precision. ... We run our experiments primarily on A100-SXM4 GPUs with 80GB DRAM on servers with dual Epyc 7742 CPU and 1Ti B RAM.
Software Dependencies	Yes	The integration is built upon vllm==0.8.5, torch==2.7.0 with CUDA 12.8 and transformers==4.51.3. ... For consistency, we run all models using scikit-learn [Pedregosa et al., 2011] v1.4.2 with all other settings kept to their default values.
Experiment Setup	Yes	We use a window size of 32 and a batch size of 1. ... greedy inference with a prompt that encourages chain-of-thought reasoning. During training, we consider two responses equivalent (a ˆa in Algorithm 1) if the extracted final answers (numbers) are equal. ... We train a classifier on the last hidden state embeddings from both draft and target models (concatenated) for encoded draft tokens. ... We train a logistic regression with the L2 regularization coefficient ( C ) with a logarithmic grid. We report additional details in Appendix B. ... we use the increased draft window size of W=64 tokens for all evaluations. ... We select a decision threshold that achieves a high recall ( 90%) in order to retain quality.