Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

TRUST: Test-Time Refinement using Uncertainty-Guided SSM Traverses

Authors: Sahar Dastani, Ali Bahri, Gustavo Vargas Hakim, Moslem Yazdanpanah, Mehrdad Noori, David OSOWIECHI, Samuel Barbeau, Ismail Ayed, Herve Lombaert, Christian Desrosiers

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments on seven benchmarks show that TRUST consistently improves robustness and outperforms existing TTA methods. The code is available at: https://github.com/Sahardastani/trust. In this section, we present a comprehensive evaluation of our proposed method across seven benchmark datasets.
Researcher Affiliation	Academia	1LIVIA, ILLS, ÉTS Montréal, Canada, 2Mila Quebec AI Institute, 3Polytechnique Montreal
Pseudocode	Yes	A Pseudo-code In this section, we give the pseudo-code for our proposed test-time adaptation method, TRUST. This pseudo-code provides a concise summary of the key steps involved in our approach, offering a high-level abstraction of the implementation. Algorithm 1 outlines the overall TRUST procedure for test-time adaptation. ... Algorithm 2 defines the FORWARD_AND_ADAPT function.
Open Source Code	Yes	The code is available at: https://github.com/Sahardastani/trust.
Open Datasets	Yes	For corruption-based robustness, we use CIFAR10-C [47], CIFAR100-C [47], and Image Net-C [47]... For domain generalization, we assess on PACS [48], Image Net-S [49], Image Net-V2 [50], and Image Net-R [51].
Dataset Splits	Yes	For PACS, which includes four domains (photo, art painting, cartoon, and sketch), we follow the standard protocol: one domain is held out for evaluation while training on the remaining three. Specifically, we use the photo domain as the held-out test set. For datasets such as Image Net-S, Image Net-V2, and Image Net-R, which share the same label space as Image Net, no fine-tuning is required.
Hardware Specification	Yes	All experiments were conducted using a single NVIDIA A6000 GPU.
Software Dependencies	No	Optimization is performed using the Adam optimizer with a learning rate of 10 4 and a batch size of 128, ensuring consistent dynamics and fair comparison across benchmarks.
Experiment Setup	Yes	Optimization is performed using the Adam optimizer with a learning rate of 10 4 and a batch size of 128, ensuring consistent dynamics and fair comparison across benchmarks.