Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Experts Don’t Cheat: Learning What You Don’t Know By Predicting Pairs
Authors: Daniel D. Johnson, Daniel Tarlow, David Duvenaud, Chris J. Maddison
ICML 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate empirically that our approach accurately estimates how much models don t know across ambiguous image classification, (synthetic) language modeling, and partially-observable navigation tasks, outperforming existing techniques. |
| Researcher Affiliation | Collaboration | 1Google Deep Mind 2University of Toronto, Department of Computer Science, Ontario, Canada. Correspondence to: Daniel D. Johnson <EMAIL>. |
| Pseudocode | Yes | Algorithm 1 Conservative adjustment of ˆV θ |
| Open Source Code | No | The paper does not provide a direct link or explicit statement about the release of its own source code for the methodology described. |
| Open Datasets | Yes | We demonstrate our technique on CIFAR-10H (Peterson et al., 2019), a relabeling of the CIFAR-10 test set (Krizhevsky, 2009) by > 50 independent annotators per image. |
| Dataset Splits | Yes | We use the next 2,000 images in CIFAR-10H as our validation set. |
| Hardware Specification | No | No specific hardware details for the experiments are mentioned beyond general acknowledgements of computing resources. |
| Software Dependencies | No | The paper mentions software like TensorFlow, Keras, JAX, and Optax, but does not provide specific version numbers for these ancillary software components. |
| Experiment Setup | Yes | We train each method using the Adam W optimizer (Loshchilov & Hutter, 2017) with batch size 512. We divide our training and hyperparameter tuning into the following phases: ... We perform a random search over learning rate and weight decay strength with 250 trials: we choose learning rate logarithmically spaced between 10 5 and 5 10 3, and we either sample weight decay uniformly between 0.05 and 0.5, or logarithmically between 10 6 and 0.05... We use a linear warmup for the learning rate during the first epoch, then use cosine weight decay. |