reproducibilityindex.ai

Stabilizing Transformer Training by Preventing Attention Entropy Collapse

Authors: Shuangfei Zhai, Tatiana Likhomanenko, Etai Littwin, Dan Busbridge, Jason Ramapuram, Yizhe Zhang, Jiatao Gu, Joshua M. Susskind

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct experiments with σReparam on image classification, image selfsupervised learning, machine translation, speech recognition, and language modeling tasks.
Researcher Affiliation	Industry	*Equal contribution 1Apple.
Pseudocode	Yes	Algorithm 1 Pseudo code of σReparam in a Py Torch-like style.
Open Source Code	Yes	Code is available at https: //github.com/apple/ml-sigma-reparam.
Open Datasets	Yes	We use the MAE Vi T-B/16 recipe (see Appendix H) for these experiments, and train for a total of 100 epochs on Image Net1k. (Deng et al., 2009; Touvron et al., 2021) and We use standard WMT 17 English German benchmark with newstest2016 as a validation and newstest2017 as test sets. and All experiments are performed on the Libri Speech dataset (Panayotov et al., 2015)
Dataset Splits	Yes	The standard Libri Speech validation sets (dev-clean and dev-other) are used to tune all hyperparameters, as well as to select the best models. Test sets (test-clean and test-other) are used only to report final word error rate (WER) performance without an external language model. and newstest2016 set as a validation set and newstest2017 as a test set for final evaluation purpose only.
Hardware Specification	Yes	All models are trained on 8 GPUs of A100 80GB with mixed precision computations and dynamic batching resulting in total batch size of 524288 tokens and train with tensor cores fp32 on 8 Ampere A100 (40GB) GPUs
Software Dependencies	No	The paper mentions software like PyTorch, Fairseq, and CuPy, but does not provide specific version numbers for these or other key software dependencies.
Experiment Setup	Yes	We use the MAE Vi T-B/16 recipe (see Appendix H) for these experiments, and train for a total of 100 epochs on Image Net1k. To simplify the analysis, we only use Image Net1k training augmentations, and use no learning rate decay schedule (i.e. the learning rate is flat after warmup). and Table 13: Training hyperparameters comparison for supervised Vi T-B/16. and Table 6: Default hyperparameters of the variants of Sim CLR used in our stability analysis.