Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Language Ranker: A Lightweight Ranking framework for LLM Decoding

Authors: Chenheng Zhang, Tianqi Du, Jizhe Zhang, Mingqing Xiao, Yifei Wang, Yisen Wang, Zhouchen Lin

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments across a wide range of tasks show that Language Ranker achieves performance comparable to large-scale reward models, while requiring only <0.5M additional parameters, significantly reducing the computational overhead during both training and inference stages. In this section, we conduct experiments on three representative LLM tasks: mathematics, coding, and function calling. We further perform detailed analyses and ablation studies, as well as evaluate the transferability of our method.
Researcher Affiliation	Collaboration	1State Key Lab of General Artificial Intelligence, School of Intelligence Science and Technology, Peking University 2Institute for Artificial Intelligence, Peking University 3MIT CSAIL, MA, USA 4Microsoft Research Asia
Pseudocode	No	The paper describes the framework and ranker designs using text and figures, but no explicit pseudocode or algorithm blocks are provided.
Open Source Code	No	Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [No] Justification: We promise to release the code after publication.
Open Datasets	Yes	For the mathematics task, we use the MATH dataset [20], which contains 12,500 competition-level problems spanning seven topics and five difficulty levels. For the coding task, we use the complete MBPP dataset [21], which consists of short Python programming problems, 374 for training and 500 for testing, each paired with test cases to evaluate the correctness of the generated solutions. For the function calling task, we adopt the xlam-function-calling-60k dataset [22], which comprises 60,000 high-quality function calling problems and answers. We use the first 1,000 queries from the Databricks-Dolly-15k dataset [43] for training. For evaluation, we adopt Alpaca Eval [44], a widely recognized benchmark for assessing instructionfollowing capabilities in LLMs.
Dataset Splits	Yes	For the mathematics task, we uniformly sample 1,000 problems each for training and testing across different topics and difficulty levels. For the coding task, we use the complete MBPP dataset [21], which consists of short Python programming problems, 374 for training and 500 for testing... For the function calling task, we adopt the xlam-function-calling-60k dataset [22]... We randomly sample 1,500 more challenging problems with more than three APIs, and split them into 1,000 training and 500 testing examples.
Hardware Specification	Yes	Table 3: The total training time on the MBPP dataset for both CPU and GPU settings, including data loading stages. Method CPU A100
Software Dependencies	No	The paper mentions using Python for coding tasks and optimizers like SGD and AdamW (Table 9), but does not provide specific version numbers for these or other key software components or libraries.
Experiment Setup	Yes	Ranker Settings: In all experiments, the rankers are implemented using either a single Transformer block or a single MLP block, and they operate on features extracted from approximately the bottom 60% of the base model s layers. During both training and evaluation, each data group consists of 10 candidate responses. The ranker is trained to classify each response as correct or incorrect, formulating the task as a binary classification problem. Cosine similarity is used to compute the final logits, and the training objective is defined by the classification loss Jcls, as specified in Equations 8 or 11. Appendix A, Table 9: The hyperparameter list provides specific values for Sampling Temperature (1.5), Sampling Max New Tokens (1024), Ranker Training Batch Size ([256, 1024]), Epoch (1), Optimizer ([SGD, Adam W]), Learning Rates, and Projection Dimension (64).