Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Test-Time Model Adaptation with Only Forward Passes

Authors: Shuaicheng Niu, Chunyan Miao, Guohao Chen, Pengcheng Wu, Peilin Zhao

ICML 2024 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on four benchmarks and full precision/quantized models verify our effectiveness. 4. Experiments
Researcher Affiliation Collaboration 1College of Computing and Data Science, Nanyang Technological University, Singapore 2Joint NTU-We Bank Research Centre on Fintech, Singapore 3Joint NTU-UBC Research Centre of Excellence in Active Living for the Elderly (LILY), Singapore 4Tencent AI Lab, Shenzhen, China.
Pseudocode Yes Algorithm 1 Forward-Optimization Adaptation (FOA).
Open Source Code Yes Code: https://github.com/mr-eggplant/FOA.
Open Datasets Yes The models are trained on the source Image Net-1K training set and the model weights are obtained from the timm repository (Wightman, 2019).
Dataset Splits Yes We use the validation set of Image Net-1K to estimate source ID statistics.
Hardware Specification Yes The Wall-Clock Time (seconds) and Memory Usage (MB) are measured for processing 50,000 images of Image Net-C on a single RTX 3090 GPU.
Software Dependencies No The paper mentions 'timm repository (Wightman, 2019)' and 'PTQ4Vi T (Yuan et al., 2022)' as software used, and also 'pycma' (https://github.com/CMA-ES/pycma), but does not provide specific version numbers for Python, PyTorch, or the libraries themselves (e.g., timm vX.Y.Z).
Experiment Setup Yes We set the number of prompt embeddings Np to 3 and initialize prompts with uniform initialization. We set the batch size (BS) to 64 by following TENT and SAR for fair comparisons. The population size K is set to 28 = 4 + 3 log(prompt dim) by following (Hansen, 2016) and λ in Eqn. (5) is set to 0.4 BS/64 on Image Net-C/V2/Sketch, and 0.2 BS/64 on Image Net-R to balance the magnitude of two losses. We use the validation set of Image Net-1K to estimate source ID statistics. The step size γ in Eqn. (7) is set to 1.0, aiming to exactly align the overall center of testing and training features.