Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Leveraging Gradients for Unsupervised Accuracy Estimation under Distribution Shift
Authors: RENCHUNZI XIE, Ambroise Odonnat, Vasilii Feofanov, Ievgen Redko, Jianfeng Zhang, Bo An
TMLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments conducted with various architectures on diverse distribution shifts demonstrate that our method significantly outperforms current state-of-the-art approaches. The code is available at https://github.com/Renchunzi-Xie/Gd Score. ... Section 5 Experiments |
| Researcher Affiliation | Collaboration | Renchunzi Xie EMAIL College of Computing and Data Science Nanyang Technological University; Ambroise Odonnat EMAIL Huawei Noah s Ark Lab, Inria Paris, France |
| Pseudocode | Yes | B Pseudo-code of Gd Score Our proposed Gd Score for unsupervised accuracy estimation can be calculated as shown in Algorithm 1. Algorithm 1: Unsupervised Accuracy Estimation via Gd Score |
| Open Source Code | Yes | The code is available at https://github.com/Renchunzi-Xie/Gd Score. |
| Open Datasets | Yes | For pre-training the neural network, we use CIFAR-10, CIFAR-100 (Krizhevsky & Hinton, 2009), Tiny Image Net (Le & Yang, 2015), Image Net (Deng et al., 2009), Office-31 (Saenko et al., 2010), Office-Home (Venkateswara et al., 2017), Camelyon17-WILDS (Koh et al., 2021), and BREEDS (Santurkar et al., 2020) ... we use CIFAR-10C, CIFAR-100C, and Image Net-C (Hendrycks & Dietterich, 2019) ... Tiny Image Net-C (Hendrycks & Dietterich, 2019) |
| Dataset Splits | No | The paper mentions using specific datasets like CIFAR-10C, CIFAR-100C, Image Net-C which span 19 types of corruption across 5 severity levels, and Tiny Image Net-C with 15 types of corruption and 5 severity levels. It also refers to |
| Hardware Specification | No | The paper states: "To show the versatility of our approach across different architectures, we perform all our experiments on Res Net18, Res Net50 (He et al., 2016) and WRN-50-2 (Zagoruyko & Komodakis, 2016) models." However, no specific hardware (e.g., GPU, CPU models, or memory) used for these experiments is mentioned. |
| Software Dependencies | No | The paper mentions "SGD with a learning rate of 10 3, cosine learning rate decay (Loshchilov & Hutter, 2016), a momentum of 0.9, and a batch size of 128." These are algorithmic parameters. No specific software libraries or frameworks with version numbers (e.g., PyTorch 1.9, TensorFlow 2.x) are provided, which would be necessary for reproducibility. |
| Experiment Setup | Yes | Training details. To show the versatility of our approach across different architectures, we perform all our experiments on Res Net18, Res Net50 (He et al., 2016) and WRN-50-2 (Zagoruyko & Komodakis, 2016) models. We train them for 20 epochs for CIFAR-10 (Krizhevsky & Hinton, 2009) and 50 epochs for the other datasets. In all cases, we use SGD with a learning rate of 10 3, cosine learning rate decay (Loshchilov & Hutter, 2016), a momentum of 0.9, and a batch size of 128. For all experiments, we used p = 0.3 to compute Gd Score. |