Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Quantifying Elicitation of Latent Capabilities in Language Models

Authors: Elizabeth Donoway, Hailey Joren, Arushi Somani, Henry Sleight, Julian Michael, Michael R. Deweese, John Schulman, Ethan Perez, Fabien Roger, Jan Leike

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this work, we recast elicitation as an information-constrained fine-tuning problem and empirically characterize upper bounds on the minimal number of parameters needed to achieve specific task performances.
Researcher Affiliation	Collaboration	1Anthropic, 2University of California, Berkeley, 3Constellation, 4Thinking Machines, 5Scale AI
Pseudocode	Yes	Algorithm 1 Prequential MDL Computation.
Open Source Code	Yes	Code repository: https://github.com/edonoway/quantifying-elicitation-neurips25
Open Datasets	Yes	We fine-tune and evaluate on 4 classification tasks and 4 generative tasks: GSM-8K-Co T-Choice (a Co T correctness classification task, see Section D.1 for dataset details), ARC-Easy, ARC-Challenge [24], and Bool Q [25] for classification tasks, and Alpaca [26], Tiny Stories [27], Lichess chess puzzles, and s1K-Qwen-1.5B3 for generation tasks.
Dataset Splits	Yes	The raw 52k examples are shuffled once with a fixed random seed and partitioned 90/10 into training (234,006 instructions) and held-out evaluation (26,006 instructions). No additional filtering or augmentation is applied.
Hardware Specification	Yes	Hardware: Single NVIDIA H100 80 GB GPU
Software Dependencies	No	The paper mentions using "Adam W as the optimizer" and "Flash Attention-2" but does not specify version numbers for these or other software libraries or programming languages used.
Experiment Setup	Yes	Learning rate: [10 6, 1] (log-uniform spacing) Batch size: {1, 2, 4, 8, 16, 32, 64, 128, 256} (discrete) Weight decay: [0, 0.1] (uniform sampling)