Unraveling the Gradient Descent Dynamics of Transformers
Authors: Bingqing Song, Boran Han, Shuai Zhang, Jie Ding, Mingyi Hong
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section, we present numerical results to illustrate the behaviors of Transformers models with Softmax attention and Gaussian kernel attention across various tasks. |
| Researcher Affiliation | Collaboration | Bingqing Song University of Minnesota, Twin Cities song0409@umn.edu Boran Han Amazon Web Services boranhan@amazon.com Shuai Zhang Amazon Web Services shuaizs@amazon.com Jie Ding University of Minnesota, Twin Cities dingj@umn.edu Mingyi Hong University of Minnesota, Twin Cities mhong@umn.edu |
| Pseudocode | No | The paper does not contain any pseudocode or clearly labeled algorithm blocks. |
| Open Source Code | No | We do not include the open access to code. |
| Open Datasets | Yes | We investigate two distinct tasks: Text Classification using the IMDb review dataset [Maas et al., 2011] and Pathfinder [Linsley et al., 2018]. |
| Dataset Splits | No | The paper describes training and testing but does not explicitly detail the split percentages or methodology for train/validation/test sets. It mentions 'test accuracy and test loss within the training steps' but not how the data was partitioned for these stages. |
| Hardware Specification | No | We do not include the compute resources detail. |
| Software Dependencies | No | For optimization, we use Stochastic Gradient Descent (SGD) for the Text Classification task and Adam for the Pathfinder task. The paper does not specify version numbers for any software libraries or dependencies. |
| Experiment Setup | Yes | For both tasks, we employ a 2-layer Transformer model with the following specifications: embedding dimension D = 64, hidden dimension d = 128, and number of attention heads H = 2. [...] we set a batch size of 16 for the Text Classification task with a learning rate of 1 10 4, and a batch size of 128 for the Pathfinder task with a learning rate of 2 10 4. For optimization, we use Stochastic Gradient Descent (SGD) for the Text Classification task and Adam for the Pathfinder task. |