Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

General Uncertainty Estimation with Delta Variances

Authors: Simon Schmitt, John Shawe-Taylor, Hado van Hasselt

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To empirically study the Delta Variance we build on the stateof-the-art Graph Cast weather forecasting system (Lam et al. 2023)... We assess the Epistemic Variance predictions on 5 years of hold-out data using multiple metrics such as the correlation between predicted variance and prediction error and the likelihood of the quantities of interest. Empirically Delta Variances with a diagonal Fisher approximation yield competitive results at lower computational cost see Figure 3.
Researcher Affiliation	Collaboration	1 Deep Mind 2 University College London, UK
Pseudocode	No	The paper describes methods and derivations in paragraph form and mathematical equations. It does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code	No	The paper does not contain an explicit statement about releasing source code for the described methodology, nor does it provide a link to a code repository.
Open Datasets	No	We build on the state-of-the-art Graph Cast weather prediction system...Training data ranges from 1979-2013 with validation data from 2014-2017 and holdout data from 2018-2021 resulting in about 100 GB of weather data. While the paper cites the Graph Cast system (Lam et al. 2023), it does not provide explicit access information (link, DOI, specific repository) for the specific data used in their experiments.
Dataset Splits	Yes	Training data ranges from 1979-2013 with validation data from 2014-2017 and holdout data from 2018-2021 resulting in about 100 GB of weather data.
Hardware Specification	No	The paper does not specify any particular hardware (GPU, CPU, TPU models) used for training or inference. It only mentions, 'To save resources we retrain the model for a grid size of 4 degrees and reduce the number of layers and latents each by factor a of 2.'
Software Dependencies	No	The paper mentions the use of 'any auto-differentiation framework' and the 'Graph Cast weather forecasting system (Lam et al. 2023)' but does not provide specific version numbers for any software dependencies or libraries used in their implementation.
Experiment Setup	Yes	To save resources we retrain the model for a grid size of 4 degrees and reduce the number of layers and latents each by factor a of 2. Finally we skip the fine-tuning curriculum for simplicity. In our experiments we optimize the coefficients of this linear combination using gradient descent to improve the loglikelihood or correlation on a small set of held-out validation data.