Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Quantifying Elicitation of Latent Capabilities in Language Models
Authors: Elizabeth Donoway, Hailey Joren, Arushi Somani, Henry Sleight, Julian Michael, Michael R. Deweese, John Schulman, Ethan Perez, Fabien Roger, Jan Leike
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this work, we recast elicitation as an information-constrained fine-tuning problem and empirically characterize upper bounds on the minimal number of parameters needed to achieve specific task performances. |
| Researcher Affiliation | Collaboration | 1Anthropic, 2University of California, Berkeley, 3Constellation, 4Thinking Machines, 5Scale AI |
| Pseudocode | Yes | Algorithm 1 Prequential MDL Computation. |
| Open Source Code | Yes | Code repository: https://github.com/edonoway/quantifying-elicitation-neurips25 |
| Open Datasets | Yes | We fine-tune and evaluate on 4 classification tasks and 4 generative tasks: GSM-8K-Co T-Choice (a Co T correctness classification task, see Section D.1 for dataset details), ARC-Easy, ARC-Challenge [24], and Bool Q [25] for classification tasks, and Alpaca [26], Tiny Stories [27], Lichess chess puzzles, and s1K-Qwen-1.5B3 for generation tasks. |
| Dataset Splits | Yes | The raw 52k examples are shuffled once with a fixed random seed and partitioned 90/10 into training (234,006 instructions) and held-out evaluation (26,006 instructions). No additional filtering or augmentation is applied. |
| Hardware Specification | Yes | Hardware: Single NVIDIA H100 80 GB GPU |
| Software Dependencies | No | The paper mentions using "Adam W as the optimizer" and "Flash Attention-2" but does not specify version numbers for these or other software libraries or programming languages used. |
| Experiment Setup | Yes | Learning rate: [10 6, 1] (log-uniform spacing) Batch size: {1, 2, 4, 8, 16, 32, 64, 128, 256} (discrete) Weight decay: [0, 0.1] (uniform sampling) |