Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

STACI: Spatio-Temporal Aleatoric Conformal Inference

Authors: Brandon Feng, David Park, Xihaier Luo, Arantxa Urdangarin, Shinjae Yoo, Brian Reich

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate STACI s performance on two ST datasets: one synthetic and one real. The first synthetic dataset is simulated mean sea surface height (MSS) data of the Arctic sea [37] based on historical satellite data from 3 different tracks. The data spans 10 days from March 1st to March 10th 2020 and has 1,158,505 total datapoints. The second real dataset is Aerosol Optical Depth (AOD) data captured using the Moderate Resolution Imaging Spectroradiometer (MODIS) on NASA s Terra satellite [53]. The data is spread on a 1400 720 grid spanning the Earth s surface and we use daily data from March 2025, equating to 3,189,641 total observations. We measure the performance of the algorithms in terms of both estimation and UQ quality on the test set of both datasets. For estimation quality, we use root mean square error (RMSE), negative Gaussian log likelihood (NLL). For UQ quality, we use the continuous ranked probability score (CRPS) metric and provide coverage of prediction intervals, interval score and interval width based on α = 0.05. Finally, we track time per training epoch to compare computational efficiency. The time for DRF is the time for one optimization iteration. Note that GPSat time is the total time to fit across all expert locations. We provide both Bayesian and conformal UQ performance for STACI. The conformal time represents time needed for the entire conformal step. For RMSE, NLL, CRPS, interval score and interval width, a lower value is better. For coverage, the value closest to 0.95 is deemed the best as estimators providing both over-coverage and under-coverage are considered to be inefficient. Table 1 shows results for the MSS dataset. Table 2 shows results for AOD dataset. Figure 2: Predicted AOD surface values. Top row: predicted surface. Red indicates higher AOD values.; Bottom row: interval widths for Bayesian and conformal (STACI-C) uncertainty on AOD data. Darker shades denote narrower intervals. Here, we perform two ablation studies for the AOD dataset to quantify model robustness. The first ablation study is impact of latent model dimension size on estimation error. The second ablation is impact of sampling percentage in training set construction on estimation error. Figure 3: Ablation Studies for STACI
Researcher Affiliation	Collaboration	Brandon R. Feng North Carolina State University David Keetae Park Brookhaven National Laboratory Xihaier Luo Brookhaven National Laboratory Arantxa Urdangarin University of the Basque Country Shinjae Yoo Brookhaven National Laboratory Brian J. Reich North Carolina State University
Pseudocode	No	The paper does not contain any clearly labeled pseudocode or algorithm blocks. Figure 1 is a diagram illustrating the algorithm pipeline, but it is not pseudocode.
Open Source Code	Yes	Code and data: https://github.com/bf5124/STACI
Open Datasets	Yes	The first synthetic dataset is simulated mean sea surface height (MSS) data of the Arctic sea [37] based on historical satellite data from 3 different tracks. The second real dataset is Aerosol Optical Depth (AOD) data captured using the Moderate Resolution Imaging Spectroradiometer (MODIS) on NASA s Terra satellite [53].
Dataset Splits	Yes	The MSS dataset is randomly split into a 80% train, 10% validation, 10% test split. As most observations are seen on each day, this setting tests ST interpolation with small artifacts such as cloud cover creating patches of missing data. For the AOD dataset, we sample 10% of observations randomly per day to comprise the training set. The validation set is all observations over the first 6 days while the test set is all observations on the 20th day.
Hardware Specification	Yes	All models are trained on NVIDIA A-100 GPUs for 15 epochs (optimization iterations for DRF) with batch size 1,024. Conformal prediction interval calculation is parallelized over 4 NVIDIA A-100 GPUs.
Software Dependencies	No	The paper mentions methods like SVGD (Stein variational gradient descent) but does not provide specific software names with version numbers for implementation details (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup	Yes	Each INR backbone has 5 layers with a layer width of 1,024. We set J = 5,000 for the final hidden layer width of STACI, representing the number of random fourier features. We then use M = 10 network copies to train using SVGD for the initial Bayesian UQ. Deep GP is set with 4 total layers with layer width 9 and trained with 10 models. GPSat is initialized with 1,225 expert locations across the spatio-temporal domain. DRF is set with 5 hidden layers of width 1,024 and bottleneck layers of width 128 and trained with 10 models. All models are trained on NVIDIA A-100 GPUs for 15 epochs (optimization iterations for DRF) with batch size 1,024.