Depth-Width Trade-offs for ReLU Networks via Sharkovsky's Theorem

Authors: Vaggos Chatziafratis, Sai Ganesh Nagarajan, Ioannis Panageas, Xiao Wang

ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section, we provide experimental evidence for our depth separation results by training a neural network of constant width, but with increasing depth on a classification task that closely resembles the n-alternating points problem that appeared in Telgarsky (2015) and is the foundation of our separation results as well. Our goal is to create a diagram showing how the classification error drops as a function of the depth of the network for a fixed value of the width.
Researcher Affiliation Academia Vaggos Chatziafratis Department of Computer Science Stanford University Sai Ganesh Nagarajan & Ioannis Panageas & Xiao Wang Singapore University of Technology and Design
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code No The paper does not contain any explicit statement about releasing source code for the methodology described, nor does it provide any links to a code repository.
Open Datasets No The paper describes the creation of a custom dataset: 'We create 8000 equally spaced points from [0,1] (in increasing order), where the first 1000 points are of label 0, the second 1000 are label 1 and this label alternates every 1000 points.' However, no concrete access information (link, DOI, repository, or formal citation for public access) is provided for this dataset.
Dataset Splits No The paper mentions creating '8000 equally spaced points' and discussing 'training error', but it does not specify explicit training, validation, or test dataset splits (e.g., percentages, sample counts, or references to predefined splits).
Hardware Specification No The paper describes its experimental procedure including varying network depth and using the ADAM optimizer, but it does not provide specific hardware details (e.g., exact GPU/CPU models, processor types, or memory amounts) used for running its experiments.
Software Dependencies No The paper mentions using 'Re LU s', 'sigmoid' activation, and the 'ADAM optimizer Kingma & Ba (2014)', but it does not provide specific version numbers for any software dependencies (e.g., programming language, libraries, or frameworks) used in the experiments.
Experiment Setup Yes To perform the experiments, we vary the depth of the neural network (excluding the input and the output layer) as d = 1, 2, 3, 4, 5. In addition, we fix the neurons for each layer to be 6. All activations are Re LU s, while the last layer is the classifier that uses a sigmoid to output probabilities. Each model adds one extra hidden layer and we make use of the same hyper-parameters to train all networks. Moreover, we require the training error or the classification error to tend to 0 during the training procedure, i.e, we will try and overfit the data (as we try to demonstrate a representation result, rather than a statistical/generalization result). Thus, for the actual training we use the same parameters to train all the different models using the ADAM optimizer Kingma & Ba (2014) and make the epochs to be 200 in order to enable overfitting.