Understanding How Encoder-Decoder Architectures Attend
Authors: Kyle Aitken, Vinay Ramasesh, Yuan Cao, Niru Maheswaranathan
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this work, we investigate how encoder-decoder networks solve different sequence-to-sequence tasks. We introduce a way of decomposing hidden states over a sequence into temporal (independent of input) and inputdriven (independent of sequence position) components. This reveals how attention matrices are formed: depending on the task requirements, networks rely more heavily on either the temporal or input-driven components. These findings hold across both recurrent and feed-forward architectures despite their differences in forming the temporal components. Overall, our results provide new insight into the inner workings of attention-based encoder-decoder networks. |
| Researcher Affiliation | Collaboration | Kyle Aitken Department of Physics University of Washington Seattle, Washington, USA kaitken17@gmail.com Vinay V Ramasesh Google Research, Blueshift Team Mountain View, California, USA Yuan Cao Google, Inc. Mountain View, California, USA Niru Maheswaranathan Google Research, Brain Team Mountain View, California, USA |
| Pseudocode | No | The paper does not contain any sections, figures, or blocks explicitly labeled as 'Pseudocode' or 'Algorithm'. |
| Open Source Code | No | The checklist section of the paper explicitly states: 'Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [No]' |
| Open Datasets | Yes | We train the AED and AO architectures on this natural language task using a subset of the para_crawl dataset Bañón et al. (2020) consisting of over 30 million parallel sentences. |
| Dataset Splits | No | The checklist section of the paper explicitly states: 'Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [No]'. The paper mentions using a 'test set of size M' for estimating components, but does not provide details about train/validation/test splits, percentages, or counts. |
| Hardware Specification | No | The paper does not specify any particular GPU models, CPU types, or other hardware components used for running the experiments. The checklist section explicitly states: 'Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [No]' |
| Software Dependencies | No | The paper mentions types of RNN cells (LSTMs, GRUs, UGRNNs) but does not provide specific software library names or version numbers (e.g., TensorFlow version, PyTorch version, Python version, specific solver versions) needed to reproduce the experiments. |
| Experiment Setup | No | The paper does not provide specific details about experimental setup, such as hyperparameter values (e.g., learning rate, batch size, number of epochs) or optimizer settings. The checklist section explicitly states: 'Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [No]' |