Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Rethinking Entropy in Test-Time Adaptation: The Missing Piece from Energy Duality

Authors: Mincheol Park, Heeji Won, Won Woo Ro, Suhyun Kim

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct experiments to validate the following: (1) the performance of Re TTA compared to existing entropyand energy-based TTA methods under various distribution shifts, including challenging scenarios such as online label shifts; (2) the self-adjusting impact of λ1 within the newly introduced loss ℓSSMpθq, its projection distributions, and replacing alternative losses with SSM; and (3) the contribution of ℓT CCpθq to performance, its role in reducing entropy, and the sensitivity to λ2.
Researcher Affiliation	Collaboration	1Samsung Advanced Institute of Technology, 1Samsung Electronics, 2Korea University, 3Yonsei University, 4Kyung Hee University
Pseudocode	Yes	Table 5: Comparison of Re TTA with baseline TTA methods defined by Algorithms 1-3 in Appendix B.1 on three Image Net-scale out-of-distribution benchmarks under mild adaptation settings.
Open Source Code	No	Answer: [No] Justification: Code will be released after the review process to avoid potential violations of the double-blind policy.
Open Datasets	Yes	We evaluate Re TTA on Image Net-C [14], a widely-used benchmark for assessing model generalization under diverse distribution shifts. The dataset consists of 15 corruption types, divided into four main categories (Noise, Blur, Weather, and Digital), each with five severity levels, for a total of 1K classes. We further evaluate Re TTA under mild adaptation conditions on three additional Image Net-scale out-of-distribution benchmarks. Image Net-R contains rendered versions of Image Net objects, introducing a large domain shift. In contrast, Image Net V2 is a closely related re-sampling of the original Image Net distribution, while Image Net-S consists of single-channel sketch drawings.
Dataset Splits	Yes	We evaluate Re TTA on Image Net-C [14], a widely-used benchmark for assessing model generalization under diverse distribution shifts. The dataset consists of 15 corruption types, divided into four main categories (Noise, Blur, Weather, and Digital), each with five severity levels, for a total of 1K classes. Table 1 compares the performance of Re TTA with state-of-the-art entropy-based methods (MEMO, Tent, EATA, SAR, De YO) and energy-based methods (TEA, AEA) on Image Net-C under mild corruption conditions (severity level 5). We evaluate Re TTA under severe online label shifts, following the setting of an infinite imbalance ratio (pmax t pyq{pmin t pyq 8) as in SAR.
Hardware Specification	No	Answer: [No] Justification: We do not report specific compute details, such as workers, memory, and time, as resource allocation varies depending on data center schedules and shared usage.
Software Dependencies	No	We perform experiments using two model architectures Res Net-50 (with BN/GN) and Vit Base (with LN) from torchvision and timm, respectively. Following SAR [27], we use SGD with momentum 0.9, a batch size of 64, and learning rates of 0.00025 (Res Net) and 0.001 (Vit).
Experiment Setup	Yes	Following SAR [27], we use SGD with momentum 0.9, a batch size of 64, and learning rates of 0.00025 (Res Net) and 0.001 (Vit). Unless otherwise stated, we also apply the data sampling and loss-reweighting scheme from De YO [20]. For TTA, we update only the affine parameters θaffine Ă θ of the normalization layers batch/group norm in Res Net-50 and layer norm in Vit Base following Tent [34]. Unless otherwise stated, we fix the TCC loss coefficient at λ2 1. All experiments use one-shot TTA: each test sample is observed and updated once. Further hyperparameters and implementation details are provided in Appendix B.