Towards Training Without Depth Limits: Batch Normalization Without Gradient Explosion

Authors: Alexandru Meterez, Amir Joudaki, Francesco Orabona, Alexander Immer, Gunnar Ratsch, Hadi Daneshmand

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We experimentally show that it is possible to avoid gradient explosion for certain non-linear activations with orthogonal random weights together with activation shaping (Martens et al., 2021). Finally, we experimentally demonstrate that avoiding gradient explosion stabilizes the training of deep MLPs with BN.
Researcher Affiliation Academia Alexandru Meterez D-INFK, ETH Z urich ameterez@ethz.ch Amir Joudaki D-INFK, ETH Z urich ajoudaki@ethz.ch Francesco Orabona CEMSE, KAUST francesco@orabona.com Alexander Immer D-INFK, ETH Z urich aimmer@ethz.ch Gunnar R atsch D-INFK, ETH Z urich raetsch@inf.ethz.ch Hadi Daneshmand MIT LIDS/Boston University hdanesh@mit.edu
Pseudocode No The paper does not contain any clearly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code Yes Code is available at: https://github.com/alexandrumeterez/bngrad
Open Datasets Yes For CIFAR10, CIFAR100, MNIST and Fashion MNIST, we empirically tested that most batches across various batch sizes are full-rank (see Section E for details on the average rank of a batch in these datasets).
Dataset Splits No The paper mentions using common datasets like CIFAR10 and MNIST and hyperparameters like 'batch size 100' but does not specify the training, validation, and test dataset splits (e.g., percentages or sample counts) or refer to standard predefined splits.
Hardware Specification No The paper does not explicitly mention any specific hardware details such as GPU models (e.g., NVIDIA A100, RTX), CPU models (e.g., Intel Xeon), or cloud instance types used for running experiments.
Software Dependencies No The paper mentions 'Py Torch' but does not provide specific version numbers for software dependencies or libraries needed to reproduce the experiment.
Experiment Setup Yes The networks are trained with vanilla SGD and the hyperparameters are width 100, batch size 100, learning rate 0.001.