DeLighT: Deep and Light-weight Transformer
Authors: Sachin Mehta, Marjan Ghazvininejad, Srinivasan Iyer, Luke Zettlemoyer, Hannaneh Hajishirzi
ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on benchmark machine translation and language modeling tasks show that De Ligh T matches or improves the performance of baseline Transformers with 2 to 3 times fewer parameters on average. |
| Researcher Affiliation | Collaboration | Sachin Mehta1, Marjan Ghazvininejad2, Srinivasan Iyer2, Luke Zettlemoyer1, 2, and Hannaneh Hajishirzi1,3 1University of Washington 2Facebook AI Research 3Allen Institute for AI |
| Pseudocode | Yes | Listing 1: "Naive implementation of GLT in Pytorch" and Listing 2: "Grouping kernel in CUDA" are provided in Appendix E. |
| Open Source Code | Yes | Our source code is open-source and is available at: https://github.com/sacmehta/delight |
| Open Datasets | Yes | We benchmark De Ligh T models on four datasets: (1) IWSLT 14 German-English (De-En), (2) WMT 16 English-Romanian (En-Ro), (3) WMT 14 English-German (WMT 14 En-De), and (4) WMT 14 English-French (WMT 14 En-Fr)... We evaluate on the Wiki Text-103 dataset (Merity et al., 2017) that has 103M/217K/245K tokens for training, validation, and testing. |
| Dataset Splits | Yes | For the IWSLT 14 De-En dataset, we replicate the setup of Wu et al. (2019) and Edunov et al. (2018), which uses 160K/7K/7K sentence pairs for training, validation, and testing with a joint BPE vocabulary of about 10K tokens, respectively. |
| Hardware Specification | Yes | train all our models for 50K iterations with a batch size of 4K tokens on a single NVIDIA GTX 1080 GPU. |
| Software Dependencies | No | We implement our models using Fairseq (Ott et al., 2019) and use their provided scripts for data pre-processing, training, and evaluation. When we enabled the dedicated CUDA kernel provided by APEX library3 for multi-head attention in Transformers. No specific version numbers are provided for these software components. |
| Experiment Setup | Yes | For IWSLT 14 De-En models, we follow the setup of Wu et al. (2019) and train all our models for 50K iterations with a batch size of 4K tokens on a single NVIDIA GTX 1080 GPU... We use Adam (Kingma and Ba, 2015) to minimize cross entropy loss with a label smoothing value of 0.1 during training. |