Learning ReLU networks to high uniform accuracy is intractable
Authors: Julius Berner, Philipp Grohs, Felix Voigtlaender
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Published as a conference paper at ICLR 2023. We prove that, under very general assumptions, the minimal number of training samples for this task scales exponentially both in the depth and the input dimension of the network architecture. Our results are further corroborated by numerical experiments presented in Section 3 below. To illustrate our main result in Theorem 2.2, we estimate the error in (2) by a tractable approximation in a student-teacher setting. For each combination, we train student networks with 3 different seeds, 3 different widths, and 3 different batch-sizes. In summary, this yields 2 × 4 × 40 × 3 × 3 × 3 = 8640 experiments each executed on a single GPU. |
| Researcher Affiliation | Academia | Julius Berner1, Philipp Grohs1,2,3, and Felix Voigtlaender4, 1Faculty of Mathematics, University of Vienna, Austria 2Research Network Data Science @ Uni Vienna, University of Vienna, Austria 3RICAM, Austrian Academy of Sciences, Austria 4Mathematical Institute for Machine Learning and Data Science (MIDS), Catholic University of Eichstätt-Ingolstadt, Germany |
| Pseudocode | No | The paper describes mathematical proofs and outlines the proof idea, but does not contain any pseudocode or algorithm blocks. |
| Open Source Code | Yes | We provide an extensible implementation4 in PyTorch (Paszke et al., 2019) featuring multi-node experiment execution and hyperparameter tuning using Ray Tune (Liaw et al., 2018), experiment tracking using Weights & Biases and Tensor Board, and flexible experiment configuration. Building upon our work, research teams with sufficient computational resources can provide further numerical evidence on an even larger scale. 4The code can be found at https://github.com/juliusberner/theory2practice. |
| Open Datasets | No | We obtain teacher networks u ∈ H∞(d,N1,...,NL−1,1),c by sampling their coefficients Φ componentwise according to a uniform distribution on [−c, c]. For every algorithm (A, m) ∈ A and seed ω ∈ Ω we consider point sequences x(u) uniformly distributed in [−0.5, 0.5]d with m(ω) = m. The corresponding point samples are used to train the coefficients of a neural network (student) using the Adam optimizer (Kingma & Ba, 2015) with exponentially decaying learning rate. The paper uses a synthetic uniform distribution to generate samples for training, which is not a named, publicly available dataset with a specific access link or citation as typically defined. |
| Dataset Splits | No | The paper mentions 'independent evaluation samples uniformly distributed' and details experimental parameters like sample sizes, but does not specify distinct training, validation, and test dataset splits with percentages or counts. It also does not reference predefined standard splits. |
| Hardware Specification | Yes | Table 1: General hyperparameters for the experiments in Figure 1 and Section 3. GPUs per training 1 (NVIDIA GTX-1080, RTX-2080Ti, A40, or A100) |
| Software Dependencies | No | The paper mentions software like 'PyTorch (Paszke et al., 2019)', 'Adam optimizer (Kingma & Ba, 2015)', 'Ray Tune (Liaw et al., 2018)', 'Weights & Biases and Tensor Board'. However, it does not provide specific version numbers for these software components. |
| Experiment Setup | Yes | Table 1: General hyperparameters for the experiments in Figure 1 and Section 3. Describes Experiment precision, GPUs, optimizer, initialization, activation function, learning rate scheduler, initial/final learning rate, decay frequency, number of samples, distribution of samples, evaluation norm, number of evaluations. Table 3: Hyperparameters specific to the experiments in Section 3. Describes Experiment samples, dimension, number of teachers, teacher architecture, activation, coefficient norm, distribution of coefficients, number of seeds, student architecture, batch-size, number of epochs. |