Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Training Two-Layer ReLU Networks with Gradient Descent is Inconsistent

Authors: David Holzmüller, Ingo Steinwart

JMLR 2022 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We further provide numerical evidence that this happens in practical situations, for some multidimensional distributions and that stochastic gradient descent exhibits similar behavior. We also provide empirical results on how the choice of initialization and optimizer can influence this behavior. We support our theory with experimental evidence in Section 9, provide further experiments on ways to resolve the problems discussed in this paper in Section 10.
Researcher Affiliation Academia David Holzmüller EMAIL University of Stuttgart Faculty of Mathematics and Physics Institute for Stochastics and Applications Ingo Steinwart EMAIL University of Stuttgart Faculty of Mathematics and Physics Institute for Stochastics and Applications
Pseudocode No The paper describes mathematical definitions (e.g., 'Definition 1', 'Definition 4'), assumptions ('Assumption 2', 'Assumption 3', 'Assumption 5'), and proofs, but does not include any clearly labeled pseudocode or algorithm blocks. The training procedure is described in prose.
Open Source Code Yes Data for the figures in this section can be reproduced using the code at github.com/dholzmueller/nn_inconsistency , which is archived at https://doi.org/10.18419/darus-2978.
Open Datasets Yes For (1) and (2), we use the following experimental setup: We compute each estimated probability using 104 Monte Carlo trials. We choose P data as the uniform distribution on the data set D from Example 1 and sample a data set D of size n = m2 from P data. For d {1, 2, 4, 8, 16, 32, 64}, we randomly draw samples (x, y) as x , x N(0, Id), u U[0, 1] y = cos(2π x ) = cos(2πu) . (6) Here, x and u are independent random variables. This makes the distribution of x rotationally invariant, with its radius being uniformly distributed on [0, 1]. The definition of y implies that no label noise is used.
Dataset Splits Yes We choose a training set with 256 d samples, use a batch size of 256 and train for 8192/d epochs such that we always use 8192 iterations. Every 64/d epochs, we compute errors on the validation and test sets, which contain 1024 samples each.
Hardware Specification No The paper mentions 'Tensorflow and Keras' and 'PyTorch' as software but provides no specific details about the hardware (GPU/CPU models, specific cloud instances, etc.) used for the experiments. It generally refers to running experiments on 'a High Performance Computing Resource' in the conclusion but without specifics.
Software Dependencies No The paper mentions 'Tensorflow and Keras' (Section 2), 'PyTorch' (Section 10.3), and 'Keras (Chollet and others, 2015)' (Section 9.1, footnote 3), as well as 'Adam' optimizer's hyperparameters. However, specific version numbers for these software packages are not provided.
Experiment Setup Yes We choose a training set with 256 d samples, use a batch size of 256 and train for 8192/d epochs such that we always use 8192 iterations. Every 64/d epochs, we compute errors on the validation and test sets, which contain 1024 samples each. For the learning rate η on the grid where the 100 repetitions have the lowest RMSE on average, another grid search is performed on the finer grid {η 2k/8, k { 3, 2, . . . , 3}} to find the finally used best learning rate. We set the momentum hyperparameter to 0.9, which corresponds to the default momentum used for Adam. Adam (Adam): A very popular adaptive optimizer proposed by Kingma and Ba (2015). Except for the learning rate, we use the standard hyperparameters β1 = 0.9, β2 = 0.999 and ϵ = 10 8.