Stabilizing Transformer Training by Preventing Attention Entropy Collapse
Authors: Shuangfei Zhai, Tatiana Likhomanenko, Etai Littwin, Dan Busbridge, Jason Ramapuram, Yizhe Zhang, Jiatao Gu, Joshua M. Susskind
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct experiments with σReparam on image classification, image selfsupervised learning, machine translation, speech recognition, and language modeling tasks. |
| Researcher Affiliation | Industry | *Equal contribution 1Apple. |
| Pseudocode | Yes | Algorithm 1 Pseudo code of σReparam in a Py Torch-like style. |
| Open Source Code | Yes | Code is available at https: //github.com/apple/ml-sigma-reparam. |
| Open Datasets | Yes | We use the MAE Vi T-B/16 recipe (see Appendix H) for these experiments, and train for a total of 100 epochs on Image Net1k. (Deng et al., 2009; Touvron et al., 2021) and We use standard WMT 17 English German benchmark with newstest2016 as a validation and newstest2017 as test sets. and All experiments are performed on the Libri Speech dataset (Panayotov et al., 2015) |
| Dataset Splits | Yes | The standard Libri Speech validation sets (dev-clean and dev-other) are used to tune all hyperparameters, as well as to select the best models. Test sets (test-clean and test-other) are used only to report final word error rate (WER) performance without an external language model. and newstest2016 set as a validation set and newstest2017 as a test set for final evaluation purpose only. |
| Hardware Specification | Yes | All models are trained on 8 GPUs of A100 80GB with mixed precision computations and dynamic batching resulting in total batch size of 524288 tokens and train with tensor cores fp32 on 8 Ampere A100 (40GB) GPUs |
| Software Dependencies | No | The paper mentions software like PyTorch, Fairseq, and CuPy, but does not provide specific version numbers for these or other key software dependencies. |
| Experiment Setup | Yes | We use the MAE Vi T-B/16 recipe (see Appendix H) for these experiments, and train for a total of 100 epochs on Image Net1k. To simplify the analysis, we only use Image Net1k training augmentations, and use no learning rate decay schedule (i.e. the learning rate is flat after warmup). and Table 13: Training hyperparameters comparison for supervised Vi T-B/16. and Table 6: Default hyperparameters of the variants of Sim CLR used in our stability analysis. |