Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Training Two-Layer ReLU Networks with Gradient Descent is Inconsistent
Authors: David Holzmüller, Ingo Steinwart
JMLR 2022 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We further provide numerical evidence that this happens in practical situations, for some multidimensional distributions and that stochastic gradient descent exhibits similar behavior. We also provide empirical results on how the choice of initialization and optimizer can influence this behavior. We support our theory with experimental evidence in Section 9, provide further experiments on ways to resolve the problems discussed in this paper in Section 10. |
| Researcher Affiliation | Academia | David Holzmüller EMAIL University of Stuttgart Faculty of Mathematics and Physics Institute for Stochastics and Applications Ingo Steinwart EMAIL University of Stuttgart Faculty of Mathematics and Physics Institute for Stochastics and Applications |
| Pseudocode | No | The paper describes mathematical definitions (e.g., 'Definition 1', 'Definition 4'), assumptions ('Assumption 2', 'Assumption 3', 'Assumption 5'), and proofs, but does not include any clearly labeled pseudocode or algorithm blocks. The training procedure is described in prose. |
| Open Source Code | Yes | Data for the figures in this section can be reproduced using the code at github.com/dholzmueller/nn_inconsistency , which is archived at https://doi.org/10.18419/darus-2978. |
| Open Datasets | Yes | For (1) and (2), we use the following experimental setup: We compute each estimated probability using 104 Monte Carlo trials. We choose P data as the uniform distribution on the data set D from Example 1 and sample a data set D of size n = m2 from P data. For d {1, 2, 4, 8, 16, 32, 64}, we randomly draw samples (x, y) as x , x N(0, Id), u U[0, 1] y = cos(2π x ) = cos(2πu) . (6) Here, x and u are independent random variables. This makes the distribution of x rotationally invariant, with its radius being uniformly distributed on [0, 1]. The definition of y implies that no label noise is used. |
| Dataset Splits | Yes | We choose a training set with 256 d samples, use a batch size of 256 and train for 8192/d epochs such that we always use 8192 iterations. Every 64/d epochs, we compute errors on the validation and test sets, which contain 1024 samples each. |
| Hardware Specification | No | The paper mentions 'Tensorflow and Keras' and 'PyTorch' as software but provides no specific details about the hardware (GPU/CPU models, specific cloud instances, etc.) used for the experiments. It generally refers to running experiments on 'a High Performance Computing Resource' in the conclusion but without specifics. |
| Software Dependencies | No | The paper mentions 'Tensorflow and Keras' (Section 2), 'PyTorch' (Section 10.3), and 'Keras (Chollet and others, 2015)' (Section 9.1, footnote 3), as well as 'Adam' optimizer's hyperparameters. However, specific version numbers for these software packages are not provided. |
| Experiment Setup | Yes | We choose a training set with 256 d samples, use a batch size of 256 and train for 8192/d epochs such that we always use 8192 iterations. Every 64/d epochs, we compute errors on the validation and test sets, which contain 1024 samples each. For the learning rate η on the grid where the 100 repetitions have the lowest RMSE on average, another grid search is performed on the finer grid {η 2k/8, k { 3, 2, . . . , 3}} to find the finally used best learning rate. We set the momentum hyperparameter to 0.9, which corresponds to the default momentum used for Adam. Adam (Adam): A very popular adaptive optimizer proposed by Kingma and Ba (2015). Except for the learning rate, we use the standard hyperparameters β1 = 0.9, β2 = 0.999 and ϵ = 10 8. |