Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Retraining with Predicted Hard Labels Provably Increases Model Accuracy

Authors: Rudrajit Das, Inderjit S Dhillon, Alessandro Epasto, Adel Javanmard, Jieming Mao, Vahab Mirrokni, Sujay Sanghavi, Peilin Zhong

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We empirically show that retraining selectively on the samples for which the predicted label matches the given label significantly improves label DP training at no extra privacy cost; we call this consensus-based retraining. For example, when training Res Net-18 on CIFAR100 with ϵ = 3 label DP, we obtain more than 6% improvement in accuracy with consensus-based retraining. ... Our main algorithmic contribution is empirically demonstrating the efficacy of consensus-based retraining in improving label DP training (Section 5).
Researcher Affiliation	Collaboration	1Google Research 2Google 3University of Texas at Austin 4University of Southern California.
Pseudocode	No	The paper describes the retraining processes ('Full retraining' and 'Consensus-based retraining') in descriptive text within Section 1 and further discusses their empirical efficacy in Section 5. However, it does not present these procedures in a structured pseudocode block or algorithm box format.
Open Source Code	No	The paper does not contain an explicit statement or a direct link to a source-code repository where the code for the methodology described in this paper is made available. It mentions using 'Small BERT model' and 'BERT English uncased preprocessor' with links, but these are third-party models, not the authors' own implementation code.
Open Datasets	Yes	Here we empirically evaluate full and consensus-based RT on four classification datasets (available on Tensor Flow) trained with label DP. These include three vision datasets, namely CIFAR-10, CIFAR-100, and Domain Net (Peng et al., 2019), and one language dataset, namely AG News Subset (Zhang et al., 2015). ... Domain Net (https://www.tensorflow.org/datasets/catalog/domainnet). ... AG News Subset (https://www.tensorflow.org/datasets/catalog/ag_news_subset).
Dataset Splits	Yes	Our training set consists of 45k examples and we assume access to a validation set with clean labels consisting of 5k examples which we use for deciding when to stop training, setting hyper-parameters, etc.7 For CIFAR-10 and CIFAR-100... We reserve 10% of the entire training set for validation and use the rest for training with label DP. Just like the CIFAR experiments, we assume that the validation set comes with clean labels.
Hardware Specification	No	The paper does not explicitly describe the specific hardware (e.g., GPU models, CPU types, memory amounts) used for running its experiments. It mentions training various models (Res Net-18, Res Net-34, BERT model) and using TensorFlow and JAX, but no hardware specifications are provided.
Software Dependencies	No	The paper states: 'Our experiments were done using Tensor Flow and JAX.' It also provides links to specific BERT model versions within TensorFlow Hub for the AG News Subset experiments: 'Small BERT model link: https://www.kaggle.com/models/tensorflow/bert/frameworks/tensorflow2/variations/bert-en-uncased-l-4-h-512-a-8/versions/2?tfhub-redirect=true, BERT English uncased preprocessor link: https://www.kaggle.com/models/tensorflow/bert/frameworks/tensorflow2/variations/en-uncased-preprocess/versions/3?tfhub-redirect=true.' While specific model versions are referenced, the core frameworks TensorFlow and JAX are not provided with their version numbers, nor are other potential software dependencies like Python or CUDA.
Experiment Setup	Yes	CIFAR-10. Optimizer is SGD with momentum = 0.9, batch size = 32, number of epochs in each stage of training (i.e., both stages of baseline, full RT and consensus-based RT) = 30. We use the cosine one-cycle learning rate schedule with initial learning rate = 0.1 for each stage of training. ... CIFAR-100. Details are the same as CIFAR-10 except that here the number of epochs in each stage of training = 40 and initial learning rate = 0.005. ... AG News Subset. Optimizer is Adam with fixed learning rate = 1e 5, batch size = 32, number of epochs in each training stage = 5.