Towards Training Without Depth Limits: Batch Normalization Without Gradient Explosion
Authors: Alexandru Meterez, Amir Joudaki, Francesco Orabona, Alexander Immer, Gunnar Ratsch, Hadi Daneshmand
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We experimentally show that it is possible to avoid gradient explosion for certain non-linear activations with orthogonal random weights together with activation shaping (Martens et al., 2021). Finally, we experimentally demonstrate that avoiding gradient explosion stabilizes the training of deep MLPs with BN. |
| Researcher Affiliation | Academia | Alexandru Meterez D-INFK, ETH Z urich ameterez@ethz.ch Amir Joudaki D-INFK, ETH Z urich ajoudaki@ethz.ch Francesco Orabona CEMSE, KAUST francesco@orabona.com Alexander Immer D-INFK, ETH Z urich aimmer@ethz.ch Gunnar R atsch D-INFK, ETH Z urich raetsch@inf.ethz.ch Hadi Daneshmand MIT LIDS/Boston University hdanesh@mit.edu |
| Pseudocode | No | The paper does not contain any clearly labeled 'Pseudocode' or 'Algorithm' blocks. |
| Open Source Code | Yes | Code is available at: https://github.com/alexandrumeterez/bngrad |
| Open Datasets | Yes | For CIFAR10, CIFAR100, MNIST and Fashion MNIST, we empirically tested that most batches across various batch sizes are full-rank (see Section E for details on the average rank of a batch in these datasets). |
| Dataset Splits | No | The paper mentions using common datasets like CIFAR10 and MNIST and hyperparameters like 'batch size 100' but does not specify the training, validation, and test dataset splits (e.g., percentages or sample counts) or refer to standard predefined splits. |
| Hardware Specification | No | The paper does not explicitly mention any specific hardware details such as GPU models (e.g., NVIDIA A100, RTX), CPU models (e.g., Intel Xeon), or cloud instance types used for running experiments. |
| Software Dependencies | No | The paper mentions 'Py Torch' but does not provide specific version numbers for software dependencies or libraries needed to reproduce the experiment. |
| Experiment Setup | Yes | The networks are trained with vanilla SGD and the hyperparameters are width 100, batch size 100, learning rate 0.001. |