Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Test-Time Adaptation by Causal Trimming

Authors: Yingnan Liu, Rui Qiao, Mong-Li Lee, Wynne Hsu

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We theoretically analyze the effectiveness of this approach and empirically validate TACT on real-world out-of-distribution benchmarks. TACT consistently outperforms state-of-the-art methods by a significant margin. Our code is available at https://github.com/Nancy Quris/TACT. ... 6 Performance Study We study the test-time adaptation performance under real-world distribution shifts, using datasets from multiple modalities, including image, audio, and text. Compared to prior works that primarily benchmark on image data, our comprehensive experiments offer broader insights into the generalizability of TACT and other TTA methods. ... Datasets. We summarize the datasets used in our experiments below:
Researcher Affiliation	Academia	1School of Computing, National University of Singapore 2Institute of Data Science, National University of Singapore 3Singapore-MIT Alliance for Research and Technology EMAIL, EMAIL
Pseudocode	No	The paper describes methods and theoretical analyses but does not include any explicitly labeled pseudocode or algorithm blocks. The procedural steps are explained in paragraph form or mathematical equations.
Open Source Code	Yes	Our code is available at https://github.com/Nancy Quris/TACT.
Open Datasets	Yes	Datasets. We summarize the datasets used in our experiments below: Birdcalls [15, 24, 34], curated by [9], is an audio classification dataset... Camelyon17 [2], sourced from from the Wilds benchmark [28], is a medical imaging dataset... Civil Comments [3], from the Wilds benchmark [28], is a natural language dataset... Image Net-R [14] contains 30,000 images... Image Net-V2 [41] is collected years after the original Image Net...
Dataset Splits	Yes	The test set includes 724 audio clips. ... The test set consists of 85,054 images. ... The test set contains 133,782 comments. ... The training starts from a weight pretrained on Image Net, and the best model is selected by macro F1 on the in-distribution validation split. ... As instructed in [28], the training starts from a randomly initialized weight, and the best model is selected by the average classification accuracy on the validation domain.
Hardware Specification	Yes	We perform experiments on the NVIDIA V100 GPU with 32GB memory.
Software Dependencies	Yes	We implement TACT using Py Torch 2.1.2.
Experiment Setup	Yes	We use a test batch size of 64 [29, 36]. There are two hyperparameters in TACT, the number of augmentation n and the number of removed principal components m. We search n {21, 22, . . . , 28}, m [1, 16] and m is an integer. For TACTadapt, we search λ {1, 5} {0.1, 1, 10, 100}. The rest hyperparameters follow the search space of SHOT. For all baseline methods, we perform hyperparameter tuning within the search spaces specified in their respective papers. The detailed configurations and search procedures are provided in Appendix D.2.