reproducibilityindex.ai

On the Dynamics of Training Attention Models

Authors: Haoye Lu, Yongyi Mao, Amiya Nayak

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments are performed, which validate our theoretical analysis and provide further insights.
Researcher Affiliation	Academia	School of Electrical Engineering and Computer Science University of Ottawa Ottawa, K1N 6N5, Canada
Pseudocode	No	The paper contains mathematical derivations and lemmas but no explicitly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code	No	The paper does not contain any explicit statements about releasing source code or provide links to a code repository.
Open Datasets	Yes	Our theoretical results and their implications are corroborated by experiments performed on a synthetic dataset and real-world datasets. Additional insights are also obtained from these experiments. For example, low-capacity classiﬁers tend to give stronger training signals to the attention module. The mutual promotion effect implied by the discovered relationship can also exhibit itself as mutual suppression in the early training phase. Furthermore, in the real-world datasets, where perfect...We performed our experiments on three models, Attn-FC, Attn-TC and Attn-TL, having the same attention block but different classiﬁers. The ﬁrst two have the classiﬁer in form c( v(χ)) = softmax(U T v(χ)) and the last in form c( v(χ)) = softmax(U T 2 Re Lu(U T 1 v(χ)+b1)+b2). Except that the U in Attn-FC is ﬁxed, other parameters of the three models are trainable and optimized using the cross-entropy loss.The second part of the experiment is performed on datasets SST2 and SST5, which contain movie comments and ratings (positive or negative in SST2 and one to ﬁve stars in SST5). For simplicity, we limit our discussion on Attn-FC and Attn-TC using the same conﬁgurations of our previous experiments except that the embedding dimension is set to 200. Remark that our goal is not to ﬁnd a state-of-the-art algorithm but to verify our theoretical results and further investigate how an attention-based network works. For both SST2 and SST5, we trained the two models by gradient descent with learning rate η = 0.1 combined with the early stopping technique (Prechelt, 2012) of patience 100. As Py Torch requires equal length sentences in a batch, we pad all the sentences to the same length and set the score of the padding symbol to the negative inﬁnity. Under this conﬁguration, the trained Attn-FC and Attn-TC reached 76.68% and 79.59% test accuracy on SST2 and 38.49% and 40.53% on SST5.
Dataset Splits	Yes	The artiﬁcial dataset, consisting of 800 training and 200 test samples, is generated through the procedure introduced in Section 3.
Hardware Specification	No	The paper does not provide specific details about the hardware used for running experiments, such as GPU or CPU models.
Software Dependencies	No	The paper mentions using 'Py Torch (Paszke et al., 2017)' but does not specify a version number for this or any other key software dependency.
Experiment Setup	Yes	We trained the models using gradient descent with learning rate η = 0.1 for 5K epochs before measuring their prediction accuracy on the test samples. When training is completed, all three models achieve the training loss close to zero and the 100.0% test accuracy, which implies the trained models perfectly explain the training set s variations and have a good generalization on the test set. Unless otherwise stated, the scores are set to zero and the embeddings are initialized by a normal distribution with mean zero and variance σ2 d = 10 6. We trained the models using gradient descent with learning rate η = 0.1 for 5K epochs before measuring their prediction accuracy on the test samples.