Learning ReLU networks to high uniform accuracy is intractable

Authors: Julius Berner, Philipp Grohs, Felix Voigtlaender

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Published as a conference paper at ICLR 2023. We prove that, under very general assumptions, the minimal number of training samples for this task scales exponentially both in the depth and the input dimension of the network architecture. Our results are further corroborated by numerical experiments presented in Section 3 below. To illustrate our main result in Theorem 2.2, we estimate the error in (2) by a tractable approximation in a student-teacher setting. For each combination, we train student networks with 3 different seeds, 3 different widths, and 3 different batch-sizes. In summary, this yields 2 × 4 × 40 × 3 × 3 × 3 = 8640 experiments each executed on a single GPU.
Researcher Affiliation Academia Julius Berner1, Philipp Grohs1,2,3, and Felix Voigtlaender4, 1Faculty of Mathematics, University of Vienna, Austria 2Research Network Data Science @ Uni Vienna, University of Vienna, Austria 3RICAM, Austrian Academy of Sciences, Austria 4Mathematical Institute for Machine Learning and Data Science (MIDS), Catholic University of Eichstätt-Ingolstadt, Germany
Pseudocode No The paper describes mathematical proofs and outlines the proof idea, but does not contain any pseudocode or algorithm blocks.
Open Source Code Yes We provide an extensible implementation4 in PyTorch (Paszke et al., 2019) featuring multi-node experiment execution and hyperparameter tuning using Ray Tune (Liaw et al., 2018), experiment tracking using Weights & Biases and Tensor Board, and flexible experiment configuration. Building upon our work, research teams with sufficient computational resources can provide further numerical evidence on an even larger scale. 4The code can be found at https://github.com/juliusberner/theory2practice.
Open Datasets No We obtain teacher networks u ∈ H∞(d,N1,...,NL−1,1),c by sampling their coefficients Φ componentwise according to a uniform distribution on [−c, c]. For every algorithm (A, m) ∈ A and seed ω ∈ Ω we consider point sequences x(u) uniformly distributed in [−0.5, 0.5]d with m(ω) = m. The corresponding point samples are used to train the coefficients of a neural network (student) using the Adam optimizer (Kingma & Ba, 2015) with exponentially decaying learning rate. The paper uses a synthetic uniform distribution to generate samples for training, which is not a named, publicly available dataset with a specific access link or citation as typically defined.
Dataset Splits No The paper mentions 'independent evaluation samples uniformly distributed' and details experimental parameters like sample sizes, but does not specify distinct training, validation, and test dataset splits with percentages or counts. It also does not reference predefined standard splits.
Hardware Specification Yes Table 1: General hyperparameters for the experiments in Figure 1 and Section 3. GPUs per training 1 (NVIDIA GTX-1080, RTX-2080Ti, A40, or A100)
Software Dependencies No The paper mentions software like 'PyTorch (Paszke et al., 2019)', 'Adam optimizer (Kingma & Ba, 2015)', 'Ray Tune (Liaw et al., 2018)', 'Weights & Biases and Tensor Board'. However, it does not provide specific version numbers for these software components.
Experiment Setup Yes Table 1: General hyperparameters for the experiments in Figure 1 and Section 3. Describes Experiment precision, GPUs, optimizer, initialization, activation function, learning rate scheduler, initial/final learning rate, decay frequency, number of samples, distribution of samples, evaluation norm, number of evaluations. Table 3: Hyperparameters specific to the experiments in Section 3. Describes Experiment samples, dimension, number of teachers, teacher architecture, activation, coefficient norm, distribution of coefficients, number of seeds, student architecture, batch-size, number of epochs.