Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Gradient-Weight Alignment as a Train-Time Proxy for Generalization in Classification Tasks
Authors: Florian Hรถlzl, Daniel Rueckert, Georgios Kaissis
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments show that GWA accurately predicts optimal early stopping, enables principled model comparisons, and identifies influential training samples, providing a validation-set-free approach for model analysis directly from the training data. |
| Researcher Affiliation | Academia | Florian A. H olzl, Daniel Rueckert, Georgios Kaissis Institute for Artifical Intelligence in Medicine Technical University of Munich EMAIL |
| Pseudocode | Yes | Algorithm 1 Estimation of GWAT |
| Open Source Code | Yes | An open-source implementation of our approach in JAX and Py Torch can be found under https://github.com/hlzl/gwa. |
| Open Datasets | Yes | We leverage established public benchmarks to aid reproducibility, including Image Net-1k (using the standard validation set for testing), Image Net-V2 [41], and Image Net Rea L [42] as well as CIFAR-10 and its noisy variant CIFAR-10-N [43] |
| Dataset Splits | Yes | Validation sets are created via a standard train/val split of the original validation data (e.g., 90% training, 10% validation). If no test sets exist, the official validation sets are used as hold-out test sets and are referred as such in the following. All models are trained for a fixed number of optimization steps, i.e., with the same compute budget regardless of training set size. Beyond label noise evaluation, we assess the robustness of models selected using different early stopping criteria on CIFAR-C and Image Net-C [45], employing realistic input perturbations consistent with our other experiments. |
| Hardware Specification | Yes | When training a Vi T/S-16 implemented in JAX on Image Net-1k with a single NVIDIA RTX A6000, GWA adds 2.5sec to the per-epoch wall-clock time (on average 1861 images/s with GWA vs. 1867 images/s without GWA for 2242px). |
| Software Dependencies | No | The paper mentions "JAX and Py Torch" for implementation but does not specify their version numbers. Table 4 lists optimizers (Adam, SGD) but not specific software dependency versions. |
| Experiment Setup | Yes | Detailed hyperparameters used for all approaches are provided in Appendix C. |