On the Training Instability of Shuffling SGD with Batch Normalization

Authors: David Xing Wu, Chulhee Yun, Suvrit Sra

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We validate our results empirically in realistic settings, and conclude that the separation between SS and RR used with batch normalization is relevant in practice.Our experiments (Fig. 1) suggest that combining SS and BN can lead to surprising and undesirable training phenomena:
Researcher Affiliation Collaboration 1Department of EECS, UC Berkeley, Berkeley, CA, USA 2Kim Jaechul Graduate School of AI, KAIST, Seoul, Korea 3Department of EECS, LIDS, MIT, Cambridge, MA, USA.CY acknowledges support from a grant funded by Samsung Electronics Co., Ltd.
Pseudocode No The paper describes methods in prose and mathematical derivations but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes see https://github.com/davidxwu/sgd-batchnorm-icml for the experiment code.
Open Datasets Yes To exhibit the above divergence on real data, we conducted experiments on the CIFAR10.For the nonlinear experiments, we extended to the CIFAR10, MNIST, and CIFAR100 datasets.
Dataset Splits No We trained linear+BN networks of depths up to 3 for T = 10^3 epochs using stepsize η = 10^-2, batch size B = 128, and 512 hidden units per layer (see Appendix D for precise details).The paper mentions using these datasets but does not explicitly provide percentages, sample counts, or specific methods for train/validation/test splits.
Hardware Specification No No specific hardware components (e.g., GPU/CPU models, processor types, memory amounts) used for the experiments are explicitly mentioned in the paper.
Software Dependencies Yes All experiments were implemented in Py Torch 1.12.
Experiment Setup Yes We trained linear+BN networks of depths up to 3 for T = 10^3 epochs using stepsize η = 10^-2, batch size B = 128, and 512 hidden units per layer (see Appendix D for precise details).The BN layers were instantiated with track running stats=False.