Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Self-Boost via Optimal Retraining: An Analysis via Approximate Message Passing

Authors: Adel Javanmard, Rudrajit Das, Alessandro Epasto, Vahab Mirrokni

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We show that extensions of the optimal g t derived in Theorem 3.2 is very effective for improving the performance of standard linear probing (i.e., fitting a linear layer on top of a pretrained model) as well as full network training with the cross-entropy loss for binary classification in the presence of label noise... We compare Bayes Mix RT/Bayes Mix-Simple RT with full RT and consensus-based RT proposed in [9]. For all our experiments, we use the body of a Res Net-50 model pretrained on Image Net... In Tables 1 and 2, we list the average test accuracies of full RT, consensus-based RT, and Bayes Mix RT (28) after 1 and 10 iterations for Med MNIST Pneumonia corrupted by the uniform noise model with p = 0.45...
Researcher Affiliation	Collaboration	Adel Javanmard University of Southern California Google Research EMAIL Rudrajit Das Google Research EMAIL Alessandro Epasto Google Research EMAIL Vahab Mirrokni Google Research EMAIL
Pseudocode	No	The paper describes iterative procedures and update rules using mathematical equations (4)-(5) and (17)-(18), but does not contain a clearly labeled pseudocode or algorithm block.
Open Source Code	No	Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [No] Justification: All our datasets are publicly available (links provided in the paper). We have provided experimental details in Appendix N to reproduce the results.
Open Datasets	Yes	We consider two datasets available on Tensor Flow: (i) Med MNIST Pneumonia [42] which is a medical binary classification dataset, and (ii) Food 101 [6] which is a multi-class food-based classification dataset... Med MNIST Pneumonia (https://www.tensorflow.org/datasets/catalog/ pneumonia_mnist)... Food-101 (https://www.tensorflow.org/datasets/catalog/food101)
Dataset Splits	Yes	Med MNIST Pneumonia (https://www.tensorflow.org/datasets/catalog/ pneumonia_mnist): This has 4708 training examples and comes with a validation set of size 200. The test set consists of 624 examples. 2. Food-101 (https://www.tensorflow.org/datasets/catalog/food101): Each class in Food-101 has 750 training examples; so the total number of examples for two classes (pho vs. ramen and spaghetti bolognese vs. spaghetti carbonara) is 1500. Out of these 1500 examples, we randomly select 100 examples as our validation set. The test set consists of 500 examples in total.
Hardware Specification	Yes	Our experiments were done using Tensor Flow and Keras on one 128 GB CPU and one 40 GB A100 GPU (per run).
Software Dependencies	No	Our experiments were done using Tensor Flow and Keras on one 128 GB CPU and one 40 GB A100 GPU (per run). The paper mentions TensorFlow and Keras but does not specify their version numbers.
Experiment Setup	Yes	For initial training as well as for each iteration of retraining, the optimizer is Adam (with default values of β1 = 0.9 and β2 = 0.999) with batch size = 32 & number of epochs = 10 for linear probing and batch size = 128 & number of epochs = 2 for full network training.7 We also apply weight decay = 0.1 in the case of full network training to mitigate overfitting.8 We tune the learning rate by monitoring the accuracy on a small clean validation set... We tune ηadv, η0 and η1 from {5 10 3,10 3,5 10 4,10 4,5 10 5,10 5,5 10 6,10 6}.