Untangling tradeoffs between recurrence and self-attention in artificial neural networks
Authors: Giancarlo Kerg, Bhargav Kanuparthi, Anirudh Goyal ALIAS PARTH GOYAL, Kyle Goyette, Yoshua Bengio, Guillaume Lajoie
NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Using simple tasks for their ease of interpretation, and their variety of computational demands, we illustrate the efficacy of this approach in numerical experiments. The remainder of this paper is as follows. In Section 2, we give a brief outline of related cognitive processes and neural network mechanisms. In Section 3, we present our central results: asymptotic guarantees for gradient propagation in self-attentive recurrent networks. To illustrate how to exploit these guarantees, in Section 4, we showcase a simple relevancy screening mechanism that aims to efficiently consolidate relevant memory, reducing the size of the computational graph from quadratic to linear in sequence length. Finally, in Section 5, we compare various recurrent and attention models with our proposed relevancy screening mechanism on a series of simple numerical experiments, while, in Section 6, we analyze their gradient propagation properties together with their GPU usage. |
| Researcher Affiliation | Academia | 1: Mila Quebec AI Institute, Canada 2: Université de Montréal, Département d Informatique et Recherche Opérationelle, Montreal, Canada 3: Université de Montréal, CIRRELT, Montreal, Canada 4: CIFAR senior fellow 5: Université de Montréal, Département de Mathématiques et Statistiques, Montreal, Canada |
| Pseudocode | Yes | Algorithm 1 Relevancy Screening |
| Open Source Code | No | The paper does not provide any explicit statements about releasing source code or links to a code repository for the described methodology. |
| Open Datasets | Yes | Copy task [19]: The characters to be copied are presented in the first 10 time steps, then must be outputted after a long delay of T time steps (see full description in Arjovsky et al. [2]). Denoise task Jing et al. [21]: This generalizes the Copy task as the symbols that need to be copied are now randomly distributed among the T time steps, requiring the model to selectively pick the inputs that need to be copied. Here, we perform tests on p MNIST [24], a variant of MNIST [25] where pixels are fed sequentially in a permuted order to the network, as well as character level Penn Tree Bank corpus (PTB) [27] where the next letter in a text needs to be predicted. |
| Dataset Splits | No | The paper does not explicitly provide specific percentages, counts, or a detailed methodology for splitting datasets into training, validation, and test sets. It mentions training on a specific task and then evaluating on variations (e.g., Transfer Copy task trained on T=100 and evaluated for T > 100), but not general data splits for model training. |
| Hardware Specification | Yes | All the models were run using a NVIDIA Titan XP GPU and their peak usage was recorded in order to quantify the amount of computational resources used for each of them. |
| Software Dependencies | No | The paper does not provide specific software dependencies with version numbers (e.g., Python 3.x, PyTorch 1.x). It mentions external models or frameworks in citations, but not its own software environment for reproducibility. |
| Experiment Setup | Yes | More precisely, C(i) is satisfied if β(i) is part of the top ρ relevance scores when compared to all previously observed hidden states, where ρ is a fixed hyper-parameter satisfying ρ |Rt| for all t. Thus the choices of ν and ρ not only directly impact computational complexity and gradient propagation, but also indirectly influence gradient propagation via the implicit effect of κ = ν + ρ on d as already discussed in Section 3. |