On the Dynamics of Training Attention Models
Authors: Haoye Lu, Yongyi Mao, Amiya Nayak
ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments are performed, which validate our theoretical analysis and provide further insights. |
| Researcher Affiliation | Academia | School of Electrical Engineering and Computer Science University of Ottawa Ottawa, K1N 6N5, Canada |
| Pseudocode | No | The paper contains mathematical derivations and lemmas but no explicitly labeled 'Pseudocode' or 'Algorithm' blocks. |
| Open Source Code | No | The paper does not contain any explicit statements about releasing source code or provide links to a code repository. |
| Open Datasets | Yes | Our theoretical results and their implications are corroborated by experiments performed on a synthetic dataset and real-world datasets. Additional insights are also obtained from these experiments. For example, low-capacity classifiers tend to give stronger training signals to the attention module. The mutual promotion effect implied by the discovered relationship can also exhibit itself as mutual suppression in the early training phase. Furthermore, in the real-world datasets, where perfect...We performed our experiments on three models, Attn-FC, Attn-TC and Attn-TL, having the same attention block but different classifiers. The first two have the classifier in form c( v(χ)) = softmax(U T v(χ)) and the last in form c( v(χ)) = softmax(U T 2 Re Lu(U T 1 v(χ)+b1)+b2). Except that the U in Attn-FC is fixed, other parameters of the three models are trainable and optimized using the cross-entropy loss.The second part of the experiment is performed on datasets SST2 and SST5, which contain movie comments and ratings (positive or negative in SST2 and one to five stars in SST5). For simplicity, we limit our discussion on Attn-FC and Attn-TC using the same configurations of our previous experiments except that the embedding dimension is set to 200. Remark that our goal is not to find a state-of-the-art algorithm but to verify our theoretical results and further investigate how an attention-based network works. For both SST2 and SST5, we trained the two models by gradient descent with learning rate η = 0.1 combined with the early stopping technique (Prechelt, 2012) of patience 100. As Py Torch requires equal length sentences in a batch, we pad all the sentences to the same length and set the score of the padding symbol to the negative infinity. Under this configuration, the trained Attn-FC and Attn-TC reached 76.68% and 79.59% test accuracy on SST2 and 38.49% and 40.53% on SST5. |
| Dataset Splits | Yes | The artificial dataset, consisting of 800 training and 200 test samples, is generated through the procedure introduced in Section 3. |
| Hardware Specification | No | The paper does not provide specific details about the hardware used for running experiments, such as GPU or CPU models. |
| Software Dependencies | No | The paper mentions using 'Py Torch (Paszke et al., 2017)' but does not specify a version number for this or any other key software dependency. |
| Experiment Setup | Yes | We trained the models using gradient descent with learning rate η = 0.1 for 5K epochs before measuring their prediction accuracy on the test samples. When training is completed, all three models achieve the training loss close to zero and the 100.0% test accuracy, which implies the trained models perfectly explain the training set s variations and have a good generalization on the test set. Unless otherwise stated, the scores are set to zero and the embeddings are initialized by a normal distribution with mean zero and variance σ2 d = 10 6. We trained the models using gradient descent with learning rate η = 0.1 for 5K epochs before measuring their prediction accuracy on the test samples. |