reproducibilityindex.ai

On the Training Instability of Shuffling SGD with Batch Normalization

Authors: David Xing Wu, Chulhee Yun, Suvrit Sra

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We validate our results empirically in realistic settings, and conclude that the separation between SS and RR used with batch normalization is relevant in practice.Our experiments (Fig. 1) suggest that combining SS and BN can lead to surprising and undesirable training phenomena:
Researcher Affiliation	Collaboration	1Department of EECS, UC Berkeley, Berkeley, CA, USA 2Kim Jaechul Graduate School of AI, KAIST, Seoul, Korea 3Department of EECS, LIDS, MIT, Cambridge, MA, USA.CY acknowledges support from a grant funded by Samsung Electronics Co., Ltd.
Pseudocode	No	The paper describes methods in prose and mathematical derivations but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	see https://github.com/davidxwu/sgd-batchnorm-icml for the experiment code.
Open Datasets	Yes	To exhibit the above divergence on real data, we conducted experiments on the CIFAR10.For the nonlinear experiments, we extended to the CIFAR10, MNIST, and CIFAR100 datasets.
Dataset Splits	No	We trained linear+BN networks of depths up to 3 for T = 10^3 epochs using stepsize η = 10^-2, batch size B = 128, and 512 hidden units per layer (see Appendix D for precise details).The paper mentions using these datasets but does not explicitly provide percentages, sample counts, or specific methods for train/validation/test splits.
Hardware Specification	No	No specific hardware components (e.g., GPU/CPU models, processor types, memory amounts) used for the experiments are explicitly mentioned in the paper.
Software Dependencies	Yes	All experiments were implemented in Py Torch 1.12.
Experiment Setup	Yes	We trained linear+BN networks of depths up to 3 for T = 10^3 epochs using stepsize η = 10^-2, batch size B = 128, and 512 hidden units per layer (see Appendix D for precise details).The BN layers were instantiated with track running stats=False.