Expected Gradients of Maxout Networks and Consequences to Parameter Initialization

Authors: Hanna Tseran, Guido Montufar

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments with deep fully-connected and convolutional networks show that this strategy improves SGD and Adam training of deep maxout networks. In addition, we obtain refined bounds on the expected number of linear regions, results on the expected curve length distortion, and results on the NTK.
Researcher Affiliation Academia 1Max Planck Institute for Mathematics in the Sciences, 04103 Leipzig, Germany 2Department of Mathematics and Department of Statistics, UCLA, Los Angeles, CA 90095, USA. Correspondence to: Hanna Tseran <hanna.tseran@mis.mpg.de>.
Pseudocode No The paper does not contain any pseudocode or clearly labeled algorithm blocks.
Open Source Code Yes 3The code is available at https://github.com/ hanna-tseran/maxout_expected_gradients.
Open Datasets Yes We use MNIST (Le Cun & Cortes, 2010), Iris (Fisher, 1936), Fashion MNIST (Xiao et al., 2017), SVHN (Netzer et al., 2011), CIFAR-10 and CIFAR-100 datasets (Krizhevsky et al., 2009).
Dataset Splits Yes We performed the dataset split into training, validation and test dataset and report the accuracy on the test set, while the validation set was used only for picking the hyper-parameters and was not used in training.
Hardware Specification Yes The most extensive experiments were running for one day on one GPU. Experiment in Figure 2 was run on a CPU cluster that uses Intel Xeon Ice Lake SP processors (Platinum 8360Y) with 72 cores per node and 256 GB RAM. All other experiments were executed on the laptop Think Pad T470 with Intel Core i5-7200U CPU with 16 GB RAM.
Software Dependencies No Experiments were implemented in Python using Tensor Flow (Mart ın Abadi et al., 2015), numpy (Harris et al., 2020) and mpi4py (Dalcin et al., 2011). The plots were created using matplotlib (Hunter, 2007).
Experiment Setup Yes The mini-batch size in all experiments is 32. The number of training epochs was picked by observing the training set loss and choosing the number of epochs for which the loss has converged. The exception is the SVHN dataset, for which we observe the double descent phenomenon and stop training after 150 epochs. ... We use the learning rate decay and choose the optimal initial learning rate for all network and initialization types based on their accuracy on the validation dataset using grid search. The learning rate was halved every nth epoch. For SVHN, n = 10, and for all other datasets, n = 100.