reproducibilityindex.ai

Towards Training Without Depth Limits: Batch Normalization Without Gradient Explosion

Authors: Alexandru Meterez, Amir Joudaki, Francesco Orabona, Alexander Immer, Gunnar Ratsch, Hadi Daneshmand

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We experimentally show that it is possible to avoid gradient explosion for certain non-linear activations with orthogonal random weights together with activation shaping (Martens et al., 2021). Finally, we experimentally demonstrate that avoiding gradient explosion stabilizes the training of deep MLPs with BN.
Researcher Affiliation	Academia	Alexandru Meterez D-INFK, ETH Z urich ameterez@ethz.ch Amir Joudaki D-INFK, ETH Z urich ajoudaki@ethz.ch Francesco Orabona CEMSE, KAUST francesco@orabona.com Alexander Immer D-INFK, ETH Z urich aimmer@ethz.ch Gunnar R atsch D-INFK, ETH Z urich raetsch@inf.ethz.ch Hadi Daneshmand MIT LIDS/Boston University hdanesh@mit.edu
Pseudocode	No	The paper does not contain any clearly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code	Yes	Code is available at: https://github.com/alexandrumeterez/bngrad
Open Datasets	Yes	For CIFAR10, CIFAR100, MNIST and Fashion MNIST, we empirically tested that most batches across various batch sizes are full-rank (see Section E for details on the average rank of a batch in these datasets).
Dataset Splits	No	The paper mentions using common datasets like CIFAR10 and MNIST and hyperparameters like 'batch size 100' but does not specify the training, validation, and test dataset splits (e.g., percentages or sample counts) or refer to standard predefined splits.
Hardware Specification	No	The paper does not explicitly mention any specific hardware details such as GPU models (e.g., NVIDIA A100, RTX), CPU models (e.g., Intel Xeon), or cloud instance types used for running experiments.
Software Dependencies	No	The paper mentions 'Py Torch' but does not provide specific version numbers for software dependencies or libraries needed to reproduce the experiment.
Experiment Setup	Yes	The networks are trained with vanilla SGD and the hyperparameters are width 100, batch size 100, learning rate 0.001.