Effect of Activation Functions on the Training of Overparametrized Neural Nets
Authors: Abhishek Panigrahi, Abhishek Shetty, Navin Goyal
ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We study the effect of the choice of activation function (we often just say activation) on the training of overparametrized neural networks. By overparametrized setting we roughly mean that the number of parameters or weights in the networks is much larger than the number of data samples. ...7 EXPERIMENTS Synthetic data. We consider n equally spaced data points on S1, randomly lifted to S9. We randomly label the data-points from U { 1, 1}. We train a 2-layer neural network in the DZPS setting with mean squared loss, containing 106 neurons in the first layer with activations tanh, Re LU, swish and ELU at learning rate 10 3. The output layer is not trained during gradient descent. In Figure 1(a) and Figure 1(b) we plot the squared loss against the number of epochs trained. Results are averaged over 5 different runs. ... Real data. We consider a random subset of 104 images from CIFAR10 dataset (Krizhevsky & Hinton, 2009). We train a 2-layer network containing 105 neurons in the first layer. |
| Researcher Affiliation | Collaboration | Abhishek Panigrahi Microsoft Research India t-abpani@microsoft.com Abhishek Shetty Cornell University shetty@cs.cornell.edu Navin Goyal Microsoft Research India navingo@microsoft.com |
| Pseudocode | No | The paper contains mathematical derivations, proofs, and theoretical analyses, but no explicit pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not contain any statement about releasing source code or a link to a code repository. |
| Open Datasets | Yes | Real data. We consider a random subset of 104 images from CIFAR10 dataset (Krizhevsky & Hinton, 2009). |
| Dataset Splits | No | The paper mentions using a 'random subset of 104 images from CIFAR10 dataset' for training a 2-layer network and verifying an assumption on data samples, but it does not specify any train/validation/test splits or cross-validation methodology. It does not provide sufficient detail to reproduce the data partitioning. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU model, CPU type, memory specifications) used for running the experiments. It only generally refers to training neural networks. |
| Software Dependencies | No | The paper does not provide specific software dependencies with version numbers (e.g., Python 3.x, PyTorch 1.x). It does not mention any libraries or frameworks used in the experiments. |
| Experiment Setup | Yes | We train a 2-layer neural network in the DZPS setting with mean squared loss, containing 106 neurons in the first layer with activations tanh, Re LU, swish and ELU at learning rate 10 3. ... We observed a difference in the rate of convergence while training a 2-layer network, with both layers trainable, using 256 batch sized stochastic gradient descent (SGD) with cross entropy loss on the random subset of CIFAR10 dataset at l.r. 10 3. |