Improving Transformers with Probabilistic Attention Keys
Authors: Tam Minh Nguyen, Tan Minh Nguyen, Dung D. D. Le, Duy Khuong Nguyen, Viet-Anh Tran, Richard Baraniuk, Nhat Ho, Stanley Osher
ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section, we numerically justify the efficiency of Transformer-MGK/MLK and empirically study the advantage of using mixture of keys on various benchmarks, including different tasks in the Long Range Arena (LRA) (Section 3.1) and language modeling on Wikitext-103 (Merity et al., 2017) (Section 3.2). |
| Researcher Affiliation | Collaboration | 1FPT Software AI Center, Ha Noi, Vietnam 2Department of Mathematics, University of California, Los Angeles, USA 3College of Engineering and Computer Science, Vin University, Ha Noi, Vietnam 4Deezer Research, France 5Department of Electrical and Computer Engineering, Rice University, Houston, USA 6Department of Statistics and Data Sciences, The University of Texas at Austin, USA. |
| Pseudocode | No | The paper describes algorithmic steps in prose but does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our Py Torch (Paszke et al., 2019) code are available at https://github.com/minhtannguyen/transformer-mgk. |
| Open Datasets | Yes | We consider the following tasks in the LRA benchmark: Listops (Nangia & Bowman, 2018), bytelevel IMDb reviews text classification (Maas et al., 2011), and byte-level document retrieval (Radev et al., 2013). ... We consider the word-level language modeling task on Wiki Text-103 (Merity et al., 2017) ... We further examine the advantages of our methods on the IWSLT 14 German-English machine translation task (Cettolo et al., 2014). |
| Dataset Splits | Yes | The validation and test sets are composed of 218K and 246K running words, respectively. Each of them contains 60 articles and about 268K words. (WikiText-103) ... The IWSLT14 German-English dataset (Cettolo et al., 2014) contains 153K training, 7K validation, and 7K test TED-talks scripts German-English translated sentences. |
| Hardware Specification | Yes | All our experiments are conducted on a server with 4 NVIDIA A100 GPUs. |
| Software Dependencies | No | The paper mentions PyTorch and other codebases but does not provide specific version numbers for software dependencies. |
| Experiment Setup | Yes | All models have 2 layers, 64 embedding dimension, and 128 hidden dimension. The number of heads in each layer are set to 1, 2, 4, and 8. For Transformer-MGK/MLKs and their shifted versions, we share πjr for all position j and learn it for each head. The initial value for each πjr is set to 0.5. ... for small models, we set the key, value, and query dimension to 128, and the training and evaluation context length to 256. ... We train our models for language modeling on 2 A100, 40GB each with a batch size of 96, and each model is trained for 120 epochs. We apply 10% dropout ... and use the Adam optimizer ... with an initial learning rate of 0.00025 and 2000 steps for learning rate warm-up. |