reproducibilityindex.ai

Untangling tradeoffs between recurrence and self-attention in artificial neural networks

Authors: Giancarlo Kerg, Bhargav Kanuparthi, Anirudh Goyal ALIAS PARTH GOYAL, Kyle Goyette, Yoshua Bengio, Guillaume Lajoie

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Using simple tasks for their ease of interpretation, and their variety of computational demands, we illustrate the efﬁcacy of this approach in numerical experiments. The remainder of this paper is as follows. In Section 2, we give a brief outline of related cognitive processes and neural network mechanisms. In Section 3, we present our central results: asymptotic guarantees for gradient propagation in self-attentive recurrent networks. To illustrate how to exploit these guarantees, in Section 4, we showcase a simple relevancy screening mechanism that aims to efﬁciently consolidate relevant memory, reducing the size of the computational graph from quadratic to linear in sequence length. Finally, in Section 5, we compare various recurrent and attention models with our proposed relevancy screening mechanism on a series of simple numerical experiments, while, in Section 6, we analyze their gradient propagation properties together with their GPU usage.
Researcher Affiliation	Academia	1: Mila Quebec AI Institute, Canada 2: Université de Montréal, Département d Informatique et Recherche Opérationelle, Montreal, Canada 3: Université de Montréal, CIRRELT, Montreal, Canada 4: CIFAR senior fellow 5: Université de Montréal, Département de Mathématiques et Statistiques, Montreal, Canada
Pseudocode	Yes	Algorithm 1 Relevancy Screening
Open Source Code	No	The paper does not provide any explicit statements about releasing source code or links to a code repository for the described methodology.
Open Datasets	Yes	Copy task [19]: The characters to be copied are presented in the ﬁrst 10 time steps, then must be outputted after a long delay of T time steps (see full description in Arjovsky et al. [2]). Denoise task Jing et al. [21]: This generalizes the Copy task as the symbols that need to be copied are now randomly distributed among the T time steps, requiring the model to selectively pick the inputs that need to be copied. Here, we perform tests on p MNIST [24], a variant of MNIST [25] where pixels are fed sequentially in a permuted order to the network, as well as character level Penn Tree Bank corpus (PTB) [27] where the next letter in a text needs to be predicted.
Dataset Splits	No	The paper does not explicitly provide specific percentages, counts, or a detailed methodology for splitting datasets into training, validation, and test sets. It mentions training on a specific task and then evaluating on variations (e.g., Transfer Copy task trained on T=100 and evaluated for T > 100), but not general data splits for model training.
Hardware Specification	Yes	All the models were run using a NVIDIA Titan XP GPU and their peak usage was recorded in order to quantify the amount of computational resources used for each of them.
Software Dependencies	No	The paper does not provide specific software dependencies with version numbers (e.g., Python 3.x, PyTorch 1.x). It mentions external models or frameworks in citations, but not its own software environment for reproducibility.
Experiment Setup	Yes	More precisely, C(i) is satisﬁed if β(i) is part of the top ρ relevance scores when compared to all previously observed hidden states, where ρ is a ﬁxed hyper-parameter satisfying ρ \|Rt\| for all t. Thus the choices of ν and ρ not only directly impact computational complexity and gradient propagation, but also indirectly inﬂuence gradient propagation via the implicit effect of κ = ν + ρ on d as already discussed in Section 3.