Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
TRUST: Test-Time Refinement using Uncertainty-Guided SSM Traverses
Authors: Sahar Dastani, Ali Bahri, Gustavo Vargas Hakim, Moslem Yazdanpanah, Mehrdad Noori, David OSOWIECHI, Samuel Barbeau, Ismail Ayed, Herve Lombaert, Christian Desrosiers
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on seven benchmarks show that TRUST consistently improves robustness and outperforms existing TTA methods. The code is available at: https://github.com/Sahardastani/trust. In this section, we present a comprehensive evaluation of our proposed method across seven benchmark datasets. |
| Researcher Affiliation | Academia | 1LIVIA, ILLS, ÉTS Montréal, Canada, 2Mila Quebec AI Institute, 3Polytechnique Montreal |
| Pseudocode | Yes | A Pseudo-code In this section, we give the pseudo-code for our proposed test-time adaptation method, TRUST. This pseudo-code provides a concise summary of the key steps involved in our approach, offering a high-level abstraction of the implementation. Algorithm 1 outlines the overall TRUST procedure for test-time adaptation. ... Algorithm 2 defines the FORWARD_AND_ADAPT function. |
| Open Source Code | Yes | The code is available at: https://github.com/Sahardastani/trust. |
| Open Datasets | Yes | For corruption-based robustness, we use CIFAR10-C [47], CIFAR100-C [47], and Image Net-C [47]... For domain generalization, we assess on PACS [48], Image Net-S [49], Image Net-V2 [50], and Image Net-R [51]. |
| Dataset Splits | Yes | For PACS, which includes four domains (photo, art painting, cartoon, and sketch), we follow the standard protocol: one domain is held out for evaluation while training on the remaining three. Specifically, we use the photo domain as the held-out test set. For datasets such as Image Net-S, Image Net-V2, and Image Net-R, which share the same label space as Image Net, no fine-tuning is required. |
| Hardware Specification | Yes | All experiments were conducted using a single NVIDIA A6000 GPU. |
| Software Dependencies | No | Optimization is performed using the Adam optimizer with a learning rate of 10 4 and a batch size of 128, ensuring consistent dynamics and fair comparison across benchmarks. |
| Experiment Setup | Yes | Optimization is performed using the Adam optimizer with a learning rate of 10 4 and a batch size of 128, ensuring consistent dynamics and fair comparison across benchmarks. |