Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Training Two-Layer ReLU Networks with Gradient Descent is Inconsistent

Authors: David Holzmüller, Ingo Steinwart

JMLR 2022 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We further provide numerical evidence that this happens in practical situations, for some multidimensional distributions and that stochastic gradient descent exhibits similar behavior. We also provide empirical results on how the choice of initialization and optimizer can inﬂuence this behavior. We support our theory with experimental evidence in Section 9, provide further experiments on ways to resolve the problems discussed in this paper in Section 10.
Researcher Affiliation	Academia	David Holzmüller EMAIL University of Stuttgart Faculty of Mathematics and Physics Institute for Stochastics and Applications Ingo Steinwart EMAIL University of Stuttgart Faculty of Mathematics and Physics Institute for Stochastics and Applications
Pseudocode	No	The paper describes mathematical deﬁnitions (e.g., 'Deﬁnition 1', 'Deﬁnition 4'), assumptions ('Assumption 2', 'Assumption 3', 'Assumption 5'), and proofs, but does not include any clearly labeled pseudocode or algorithm blocks. The training procedure is described in prose.
Open Source Code	Yes	Data for the ﬁgures in this section can be reproduced using the code at github.com/dholzmueller/nn_inconsistency , which is archived at https://doi.org/10.18419/darus-2978.
Open Datasets	Yes	For (1) and (2), we use the following experimental setup: We compute each estimated probability using 104 Monte Carlo trials. We choose P data as the uniform distribution on the data set D from Example 1 and sample a data set D of size n = m2 from P data. For d {1, 2, 4, 8, 16, 32, 64}, we randomly draw samples (x, y) as x , x N(0, Id), u U[0, 1] y = cos(2π x ) = cos(2πu) . (6) Here, x and u are independent random variables. This makes the distribution of x rotationally invariant, with its radius being uniformly distributed on [0, 1]. The deﬁnition of y implies that no label noise is used.
Dataset Splits	Yes	We choose a training set with 256 d samples, use a batch size of 256 and train for 8192/d epochs such that we always use 8192 iterations. Every 64/d epochs, we compute errors on the validation and test sets, which contain 1024 samples each.
Hardware Specification	No	The paper mentions 'Tensorflow and Keras' and 'PyTorch' as software but provides no specific details about the hardware (GPU/CPU models, specific cloud instances, etc.) used for the experiments. It generally refers to running experiments on 'a High Performance Computing Resource' in the conclusion but without specifics.
Software Dependencies	No	The paper mentions 'Tensorflow and Keras' (Section 2), 'PyTorch' (Section 10.3), and 'Keras (Chollet and others, 2015)' (Section 9.1, footnote 3), as well as 'Adam' optimizer's hyperparameters. However, specific version numbers for these software packages are not provided.
Experiment Setup	Yes	We choose a training set with 256 d samples, use a batch size of 256 and train for 8192/d epochs such that we always use 8192 iterations. Every 64/d epochs, we compute errors on the validation and test sets, which contain 1024 samples each. For the learning rate η on the grid where the 100 repetitions have the lowest RMSE on average, another grid search is performed on the ﬁner grid {η 2k/8, k { 3, 2, . . . , 3}} to ﬁnd the ﬁnally used best learning rate. We set the momentum hyperparameter to 0.9, which corresponds to the default momentum used for Adam. Adam (Adam): A very popular adaptive optimizer proposed by Kingma and Ba (2015). Except for the learning rate, we use the standard hyperparameters β1 = 0.9, β2 = 0.999 and ϵ = 10 8.