Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Monitoring Risks in Test-Time Adaptation

Authors: Mona Schirmer, Metod Jazbec, Christian Andersson Naesseth, Eric Nalisnick

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In 5, we extensively study our monitoring tool and demonstrate that (i) it reliably detects risk violations and (ii) does not raise false alarms on a range of TTA methods, datasets and shift types. We empirically validate the effectiveness of our monitoring tool for a range of TTA methods under different distribution shifts.
Researcher Affiliation	Academia	Mona Schirmer1, Metod Jazbec1, Christian A. Naesseth1 Eric Nalisnick2 1Uv A-Bosch Delta Lab, University of Amsterdam 2Johns Hopkins University
Pseudocode	Yes	Our approach to monitoring risks in TTA is summarized in Algo. 1. Our full threshold selection procedure is summarized in Algo. 2.
Open Source Code	Yes	Our code is available at: https://github. com/monasch/tta-monitor.
Open Datasets	Yes	We evaluate our monitoring approach on three datasets: synthetic corruptions from Image Net-C [12], and real-world distribution shifts from Yearbook [8] and FMo W-Time [45].
Dataset Splits	Yes	Yearbook involves binary gender classification from portrait images, while FMo W-Time consists of satellite imagery with land use labels. Both datasets span multiple years; models are trained on data up to a cutoff year and tested on future samples. For Yearbook and FMo W, we follow the protocol of Yao et al. [45], using their provided model weights: a small CNN for Yearbook and Dense Net121 [14] for FMo W. For threshold selection ( 3.4) we use Ncal = 1000 labeled samples from P0.
Hardware Specification	Yes	All experiments are performed on NVIDIA RTX 6000 Ada with 48GB memory.
Software Dependencies	No	We use the confseq package [63] by [13] to compute the conjugate-mixture empirical Bernstein confidence lower bound on the target risk. For Image Net-C, we use the pretrained Vi T-Base model [7] from the Timm library [34]
Experiment Setup	Yes	If not specified otherwise, we use a tolerance threshold of ϵtol = 0.05 for 0-1 loss and ϵtol = 0.01 for Brier loss. We set α = αsource +αtest to 0.2 using most budget for controlling the test risk, i.e. αtest = 0.175 and αsource = 0.025. For threshold selection ( 3.4) we use Ncal = 1000 labeled samples from P0. For each TTA method, we use the default hyperparameters proposed in the respective paper. We use a test batch size of 32 for Image Net and 64 for Yearbook and FMo W-Time.