On the Training Instability of Shuffling SGD with Batch Normalization
Authors: David Xing Wu, Chulhee Yun, Suvrit Sra
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We validate our results empirically in realistic settings, and conclude that the separation between SS and RR used with batch normalization is relevant in practice.Our experiments (Fig. 1) suggest that combining SS and BN can lead to surprising and undesirable training phenomena: |
| Researcher Affiliation | Collaboration | 1Department of EECS, UC Berkeley, Berkeley, CA, USA 2Kim Jaechul Graduate School of AI, KAIST, Seoul, Korea 3Department of EECS, LIDS, MIT, Cambridge, MA, USA.CY acknowledges support from a grant funded by Samsung Electronics Co., Ltd. |
| Pseudocode | No | The paper describes methods in prose and mathematical derivations but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | see https://github.com/davidxwu/sgd-batchnorm-icml for the experiment code. |
| Open Datasets | Yes | To exhibit the above divergence on real data, we conducted experiments on the CIFAR10.For the nonlinear experiments, we extended to the CIFAR10, MNIST, and CIFAR100 datasets. |
| Dataset Splits | No | We trained linear+BN networks of depths up to 3 for T = 10^3 epochs using stepsize η = 10^-2, batch size B = 128, and 512 hidden units per layer (see Appendix D for precise details).The paper mentions using these datasets but does not explicitly provide percentages, sample counts, or specific methods for train/validation/test splits. |
| Hardware Specification | No | No specific hardware components (e.g., GPU/CPU models, processor types, memory amounts) used for the experiments are explicitly mentioned in the paper. |
| Software Dependencies | Yes | All experiments were implemented in Py Torch 1.12. |
| Experiment Setup | Yes | We trained linear+BN networks of depths up to 3 for T = 10^3 epochs using stepsize η = 10^-2, batch size B = 128, and 512 hidden units per layer (see Appendix D for precise details).The BN layers were instantiated with track running stats=False. |