Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Leveraging Gradients for Unsupervised Accuracy Estimation under Distribution Shift

Authors: RENCHUNZI XIE, Ambroise Odonnat, Vasilii Feofanov, Ievgen Redko, Jianfeng Zhang, Bo An

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments conducted with various architectures on diverse distribution shifts demonstrate that our method significantly outperforms current state-of-the-art approaches. The code is available at https://github.com/Renchunzi-Xie/Gd Score. ... Section 5 Experiments
Researcher Affiliation	Collaboration	Renchunzi Xie EMAIL College of Computing and Data Science Nanyang Technological University; Ambroise Odonnat EMAIL Huawei Noah s Ark Lab, Inria Paris, France
Pseudocode	Yes	B Pseudo-code of Gd Score Our proposed Gd Score for unsupervised accuracy estimation can be calculated as shown in Algorithm 1. Algorithm 1: Unsupervised Accuracy Estimation via Gd Score
Open Source Code	Yes	The code is available at https://github.com/Renchunzi-Xie/Gd Score.
Open Datasets	Yes	For pre-training the neural network, we use CIFAR-10, CIFAR-100 (Krizhevsky & Hinton, 2009), Tiny Image Net (Le & Yang, 2015), Image Net (Deng et al., 2009), Office-31 (Saenko et al., 2010), Office-Home (Venkateswara et al., 2017), Camelyon17-WILDS (Koh et al., 2021), and BREEDS (Santurkar et al., 2020) ... we use CIFAR-10C, CIFAR-100C, and Image Net-C (Hendrycks & Dietterich, 2019) ... Tiny Image Net-C (Hendrycks & Dietterich, 2019)
Dataset Splits	No	The paper mentions using specific datasets like CIFAR-10C, CIFAR-100C, Image Net-C which span 19 types of corruption across 5 severity levels, and Tiny Image Net-C with 15 types of corruption and 5 severity levels. It also refers to
Hardware Specification	No	The paper states: "To show the versatility of our approach across different architectures, we perform all our experiments on Res Net18, Res Net50 (He et al., 2016) and WRN-50-2 (Zagoruyko & Komodakis, 2016) models." However, no specific hardware (e.g., GPU, CPU models, or memory) used for these experiments is mentioned.
Software Dependencies	No	The paper mentions "SGD with a learning rate of 10 3, cosine learning rate decay (Loshchilov & Hutter, 2016), a momentum of 0.9, and a batch size of 128." These are algorithmic parameters. No specific software libraries or frameworks with version numbers (e.g., PyTorch 1.9, TensorFlow 2.x) are provided, which would be necessary for reproducibility.
Experiment Setup	Yes	Training details. To show the versatility of our approach across different architectures, we perform all our experiments on Res Net18, Res Net50 (He et al., 2016) and WRN-50-2 (Zagoruyko & Komodakis, 2016) models. We train them for 20 epochs for CIFAR-10 (Krizhevsky & Hinton, 2009) and 50 epochs for the other datasets. In all cases, we use SGD with a learning rate of 10 3, cosine learning rate decay (Loshchilov & Hutter, 2016), a momentum of 0.9, and a batch size of 128. For all experiments, we used p = 0.3 to compute Gd Score.