reproducibilityindex.ai

Eureka-Moments in Transformers: Multi-Step Tasks Reveal Softmax Induced Optimization Problems

Authors: David T Hoffmann, Simon Schrodi, Jelena Bratulić, Nadine Behrmann, Volker Fischer, Thomas Brox

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this work, we study rapid improvements of the training loss in transformers when being confronted with multi-step decision tasks. We found that transformers struggle to learn the intermediate task and both training and validation loss saturate for hundreds of epochs. When transformers ﬁnally learn the intermediate task, they do this rapidly and unexpectedly. We call these abrupt improvements Eureka-moments, since the transformer appears to suddenly learn a previously incomprehensible concept. We designed synthetic tasks to study the problem in detail, but the leaps in performance can be observed also for language modeling and in-context learning (ICL). Our study reveals that transformers have difﬁculties in learning such two-step tasks (Fig. 1b). In summary, the contributions of this analysis paper are: 1) We study multi-step learning without intermediate supervision via a fully controlled data-generating process on synthetic tasks. ... 4) To validate the role of Softmax, we mitigate the failure mode through targeted interventions. We show that these interventions lead to signiﬁcantly faster convergence, higher accuracy, higher robustness to suboptimal hyper-parameters, and higher probability of model convergence, afﬁrming our analysis.
Researcher Affiliation	Collaboration	1University of Freiburg 2Bosch Center for AI 3Amazon (work done while at Bosch). Correspondence to: David T. Hoffmann <hoffmann@cs.uni-freiburg.de>.
Pseudocode	No	No explicit pseudocode or algorithm blocks were found.
Open Source Code	Yes	The code to reproduce the results and create the datasets is available1. 1https://github.com/boschresearch/ eureka Moments
Open Datasets	Yes	Vision dataset creation. The visual datasets are based on MNIST (Le Cun et al., 2010) and Fashion-MNIST (Xiao et al., 2017). Image Net-100 (Tian et al., 2020).
Dataset Splits	No	No explicit statement of dataset split ratios (e.g., '80% training, 10% validation, 10% test') or absolute sample counts for validation were found for all datasets used.
Hardware Specification	Yes	We train all models with a batch size of 512, which ﬁts on a single V100, for all the architectures that we considered.
Software Dependencies	No	No specific version numbers for software dependencies (e.g., Python, PyTorch, TensorFlow, etc.) were found.
Experiment Setup	Yes	Unless stated otherwise, we train a Vi T with 7 layers, 4 heads each with embedding dimension of 64, patch size of 4 and MLP-ratio of 2. Consequently, the default temperature is dk = 8. ... For optimization we use Adam W (Loshchilov & Hutter, 2019) with default values, i.e., β1 = 0.9 and β2 = 0.999 and ϵ = 10^-8. Unless otherwise stated we Warmup the learning rate for 5 epochs from 10^-6 to the maximum learning rate, use a weight decay of 0.05 and train for 300 epochs. We train all models with a batch size of 512, which ﬁts on a single V100, for all the architectures that we considered.