Eureka-Moments in Transformers: Multi-Step Tasks Reveal Softmax Induced Optimization Problems
Authors: David T Hoffmann, Simon Schrodi, Jelena Bratulić, Nadine Behrmann, Volker Fischer, Thomas Brox
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this work, we study rapid improvements of the training loss in transformers when being confronted with multi-step decision tasks. We found that transformers struggle to learn the intermediate task and both training and validation loss saturate for hundreds of epochs. When transformers finally learn the intermediate task, they do this rapidly and unexpectedly. We call these abrupt improvements Eureka-moments, since the transformer appears to suddenly learn a previously incomprehensible concept. We designed synthetic tasks to study the problem in detail, but the leaps in performance can be observed also for language modeling and in-context learning (ICL). Our study reveals that transformers have difficulties in learning such two-step tasks (Fig. 1b). In summary, the contributions of this analysis paper are: 1) We study multi-step learning without intermediate supervision via a fully controlled data-generating process on synthetic tasks. ... 4) To validate the role of Softmax, we mitigate the failure mode through targeted interventions. We show that these interventions lead to significantly faster convergence, higher accuracy, higher robustness to suboptimal hyper-parameters, and higher probability of model convergence, affirming our analysis. |
| Researcher Affiliation | Collaboration | 1University of Freiburg 2Bosch Center for AI 3Amazon (work done while at Bosch). Correspondence to: David T. Hoffmann <hoffmann@cs.uni-freiburg.de>. |
| Pseudocode | No | No explicit pseudocode or algorithm blocks were found. |
| Open Source Code | Yes | The code to reproduce the results and create the datasets is available1. 1https://github.com/boschresearch/ eureka Moments |
| Open Datasets | Yes | Vision dataset creation. The visual datasets are based on MNIST (Le Cun et al., 2010) and Fashion-MNIST (Xiao et al., 2017). Image Net-100 (Tian et al., 2020). |
| Dataset Splits | No | No explicit statement of dataset split ratios (e.g., '80% training, 10% validation, 10% test') or absolute sample counts for validation were found for all datasets used. |
| Hardware Specification | Yes | We train all models with a batch size of 512, which fits on a single V100, for all the architectures that we considered. |
| Software Dependencies | No | No specific version numbers for software dependencies (e.g., Python, PyTorch, TensorFlow, etc.) were found. |
| Experiment Setup | Yes | Unless stated otherwise, we train a Vi T with 7 layers, 4 heads each with embedding dimension of 64, patch size of 4 and MLP-ratio of 2. Consequently, the default temperature is dk = 8. ... For optimization we use Adam W (Loshchilov & Hutter, 2019) with default values, i.e., β1 = 0.9 and β2 = 0.999 and ϵ = 10^-8. Unless otherwise stated we Warmup the learning rate for 5 epochs from 10^-6 to the maximum learning rate, use a weight decay of 0.05 and train for 300 epochs. We train all models with a batch size of 512, which fits on a single V100, for all the architectures that we considered. |