Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Gradient-Weight Alignment as a Train-Time Proxy for Generalization in Classification Tasks

Authors: Florian Hölzl, Daniel Rueckert, Georgios Kaissis

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments show that GWA accurately predicts optimal early stopping, enables principled model comparisons, and identifies influential training samples, providing a validation-set-free approach for model analysis directly from the training data.
Researcher Affiliation	Academia	Florian A. H olzl, Daniel Rueckert, Georgios Kaissis Institute for Artifical Intelligence in Medicine Technical University of Munich EMAIL
Pseudocode	Yes	Algorithm 1 Estimation of GWAT
Open Source Code	Yes	An open-source implementation of our approach in JAX and Py Torch can be found under https://github.com/hlzl/gwa.
Open Datasets	Yes	We leverage established public benchmarks to aid reproducibility, including Image Net-1k (using the standard validation set for testing), Image Net-V2 [41], and Image Net Rea L [42] as well as CIFAR-10 and its noisy variant CIFAR-10-N [43]
Dataset Splits	Yes	Validation sets are created via a standard train/val split of the original validation data (e.g., 90% training, 10% validation). If no test sets exist, the official validation sets are used as hold-out test sets and are referred as such in the following. All models are trained for a fixed number of optimization steps, i.e., with the same compute budget regardless of training set size. Beyond label noise evaluation, we assess the robustness of models selected using different early stopping criteria on CIFAR-C and Image Net-C [45], employing realistic input perturbations consistent with our other experiments.
Hardware Specification	Yes	When training a Vi T/S-16 implemented in JAX on Image Net-1k with a single NVIDIA RTX A6000, GWA adds 2.5sec to the per-epoch wall-clock time (on average 1861 images/s with GWA vs. 1867 images/s without GWA for 2242px).
Software Dependencies	No	The paper mentions "JAX and Py Torch" for implementation but does not specify their version numbers. Table 4 lists optimizers (Adam, SGD) but not specific software dependency versions.
Experiment Setup	Yes	Detailed hyperparameters used for all approaches are provided in Appendix C.