Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Causal Climate Emulation with Bayesian Filtering

Authors: Sebastian H. M. Hickman, Ilija Trajković, Julia Kaltenborn, Francis Pelletier, Alex Archibald, Yaniv Gurwicz, Peer Nowack, David Rolnick, Julien Boussard

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate that our emulator learns accurate climate dynamics, and we show the importance of each one of its components on a realistic synthetic dataset and data from two widely deployed climate models. We demonstrate the performance of our model on a synthetic dataset that mimics atmospheric dynamics and on a dataset from a widely deployed climate model, and we perform ablation studies to show the importance of each component of the model.
Researcher Affiliation	Collaboration	Sebastian Hickman1,4, Ilija Trajkovic2 Julia Kaltenborn3,4 Francis Pelletier4 Alex Archibald1 Yaniv Gurwicz5 Peer Nowack2 David Rolnick3,4 Julien Boussard3,4 1Yusuf Hamied Department of Chemistry, University of Cambridge, UK 2Institute of Theoretical Informatics, Karlsruhe Institute of Technology, Germany 3School of Computer Science, Mc Gill University, Canada 4Mila Quebec AI Institute, Canada 5Intel Labs, Israel now at the European Centre for Medium-Range Weather Forecasts, UK
Pseudocode	Yes	Algorithm 1 Autoregressive rollout with Bayesian filtering Input: Observations x T , trained encoder p(z t\|x t), trained decoder x = f(z), learned transition model p(zt\|z<t), ground truth spatial spectrum x with standard deviation σ, number of sampled trajectories N and sample size R, prediction time range m Initialization: Get N samples z T n p(z T \|x T ) for i = 1 to m do Sample, for each n N, R samples z T +i n,r p(z T +i\|z<T +i) For each n N, r R, decode x T +i n,r = f(z T +i n,r ) and use FFT to get the associated spectrum x T +i n,r and weights w T +i n,r = L( x\| x T +i n,r , σ) For each n N, sample one value from {z T +i n,r } using unnormalized weights {w T +i n,r p(z T +i n,r \|z<T +i n )} end for Output: N latent trajectories {z T t T +m n N }
Open Source Code	Yes	The code is provided in a .zip file as part of the supplementary material, and will be made public through Git Hub upon acceptance of the paper. The synthetic data experiment can be fully reproduced using the provided code.
Open Datasets	Yes	Processed data used in this work are available at a Zenodo repository (https://zenodo.org/records/14773929). Raw Nor ESM2 data can be downloaded from the ESGF CMIP6 data store. ... Nor ESM2 [Seland et al., 2020] and CESM2-FV2 [Danabasoglu et al., 2020] ... The spatially averaged vector autoregressive (SAVAR) model [Tibau et al., 2022]
Dataset Splits	Yes	The 800 years of data are split into 90% train and 10% test sets.
Hardware Specification	Yes	To train PICABU on 800 years of climate model data, we used the following compute resources: 2 RTX8000 GPUs, with 48GB of RAM each, for 10 hours. These resources scale linearly with the number of variables, number of latents, or number of input timesteps τ. The synthetic experiments are much faster ( 0.5 hour on 1 RTX8000 GPU), as it is lower dimensional.
Software Dependencies	No	The paper does not explicitly state specific version numbers for software dependencies. It mentions optimizers like 'rmsprop' but not with version numbers for libraries like PyTorch or Python itself.
Experiment Setup	Yes	We provide the values of the hyperparameters that we use for the model training and for the Bayesian filtering in Table 7. These hyperparameters were determined with manual tuning, with the values of the spectral penalties being important for model performance. We performed a search over the 20 parameters described in Table 7, with 100 runs of PICABU. ... Table 7: Hyperparameter values used for training PICABU and reported results. Learning Rate 0.0003 Batch size 128 Iterations 200000 Optimizer rmsprop Number of latents 90 τ, number of input timesteps 5 CRPS coefficient 1 Spatial spectrum coefficient 3000 Temporal spectrum coefficient 2000 Transition model Hidden Layers 2 Neurons per Layer 8 Encoder-decoder model Hidden Layers 2 Neurons per Layer 16 Sparsity constraint Initial µ 1e-1 Multiplication factor µ 1.2 Threshold 1e-4 Constrained value 0.5 Orthogonality constraint Initial µ 1e5 Multiplication factor µ 1.2 Threshold 1e-4 Bayesian filtering N 300 R 10