Monotonic Location Attention for Length Generalization
Authors: Jishnu Ray Chowdhury, Cornelia Caragea
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We explore different ways to utilize positionbased cross-attention in seq2seq networks to enable length generalization in algorithmic tasks. We show that a simple approach of interpolating the original and reversed encoded representations combined with relative attention allows nearperfect length generalization for both forward and reverse lookup tasks or copy tasks that had been generally hard to tackle. We also devise harder diagnostic tasks where the relative distance of the ideal attention position varies with timestep. In such settings, the simple interpolation trick with relative attention is not sufficient. We introduce novel variants of location attention building on top of Dubois et al. (2020) to address the new diagnostic tasks. We also show the benefits of our approaches for length generalization in SCAN (Lake & Baroni, 2018) and CFQ (Keysers et al., 2020). |
| Researcher Affiliation | Academia | 1Computer Science, University of Illinois Chicago. Correspondence to: Jishnu Ray Chowdhury <jraych2@uic.edu>, Cornelia Caragea <cornelia@uic.edu>. |
| Pseudocode | No | The paper describes its methods verbally and mathematically but does not include any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our code is available on Git Hub1. 1https://github.com/JRC1995/ Monotonic Location Attention |
| Open Datasets | Yes | To achieve the above desideratum and evaluate length generalization capability of different interlayer attention mechanisms, we set up ten synthetic probing task (see Table 1 and 2). Following prior work (Graves et al., 2014; Dehghani et al., 2019; Liang et al., 2021), we first consider the task of simply copying source texts... Following Dubois et al. (2020), we also consider compositional lookup table task (Liska et al., 2018)... We also show that our models maintain comparable performance in the SCAN (Lake & Baroni, 2018) and CFQ (Keysers et al., 2020) length splits. |
| Dataset Splits | Yes | For the development set, we generated 2, 000 samples of sequence length 10-15. ... The development split consists of about 500 samples of sequence length 6 and approximately 500 samples of length 7. |
| Hardware Specification | No | The paper mentions general architectural components like GRUs and attention mechanisms but does not specify any particular hardware (e.g., GPU models, CPU models, or cloud computing instances) used for the experiments. |
| Software Dependencies | No | The paper mentions using a "Bidirectional GRU based seq2seq model" and "Adam (default parameters)" but does not provide specific version numbers for any software libraries, frameworks, or environments. |
| Experiment Setup | Yes | We use 64 as the embedding size (i.e de = 64) and single-layered GRUs for the encoder/decoder. The total hidden size for the encoder/decoder GRU is 128 (therefore d = 128). We only use one head for the attention mechanism. We use a dropout of 50% on the encodings similar to Dubois et al. (2020). We set β = 5. ... For CFQ, we use two layered GRUs for the encoder/decoder and twice the hidden size/embedding size than above. Generally, we use a batch size 32, a learning rate of 1e 3 with Adam (default parameters) and no weight decay. We halve the learning rate if the accuracy plateaus for four contiguous epochs. We run the models for a maximum of 100 epochs with 50 patience for early stopping. |