reproducibilityindex.ai

Unraveling the Gradient Descent Dynamics of Transformers

Authors: Bingqing Song, Boran Han, Shuai Zhang, Jie Ding, Mingyi Hong

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this section, we present numerical results to illustrate the behaviors of Transformers models with Softmax attention and Gaussian kernel attention across various tasks.
Researcher Affiliation	Collaboration	Bingqing Song University of Minnesota, Twin Cities song0409@umn.edu Boran Han Amazon Web Services boranhan@amazon.com Shuai Zhang Amazon Web Services shuaizs@amazon.com Jie Ding University of Minnesota, Twin Cities dingj@umn.edu Mingyi Hong University of Minnesota, Twin Cities mhong@umn.edu
Pseudocode	No	The paper does not contain any pseudocode or clearly labeled algorithm blocks.
Open Source Code	No	We do not include the open access to code.
Open Datasets	Yes	We investigate two distinct tasks: Text Classification using the IMDb review dataset [Maas et al., 2011] and Pathfinder [Linsley et al., 2018].
Dataset Splits	No	The paper describes training and testing but does not explicitly detail the split percentages or methodology for train/validation/test sets. It mentions 'test accuracy and test loss within the training steps' but not how the data was partitioned for these stages.
Hardware Specification	No	We do not include the compute resources detail.
Software Dependencies	No	For optimization, we use Stochastic Gradient Descent (SGD) for the Text Classification task and Adam for the Pathfinder task. The paper does not specify version numbers for any software libraries or dependencies.
Experiment Setup	Yes	For both tasks, we employ a 2-layer Transformer model with the following specifications: embedding dimension D = 64, hidden dimension d = 128, and number of attention heads H = 2. [...] we set a batch size of 16 for the Text Classification task with a learning rate of 1 10 4, and a batch size of 128 for the Pathfinder task with a learning rate of 2 10 4. For optimization, we use Stochastic Gradient Descent (SGD) for the Text Classification task and Adam for the Pathfinder task.