DeLighT: Deep and Light-weight Transformer

Authors: Sachin Mehta, Marjan Ghazvininejad, Srinivasan Iyer, Luke Zettlemoyer, Hannaneh Hajishirzi

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on benchmark machine translation and language modeling tasks show that De Ligh T matches or improves the performance of baseline Transformers with 2 to 3 times fewer parameters on average.
Researcher Affiliation Collaboration Sachin Mehta1, Marjan Ghazvininejad2, Srinivasan Iyer2, Luke Zettlemoyer1, 2, and Hannaneh Hajishirzi1,3 1University of Washington 2Facebook AI Research 3Allen Institute for AI
Pseudocode Yes Listing 1: "Naive implementation of GLT in Pytorch" and Listing 2: "Grouping kernel in CUDA" are provided in Appendix E.
Open Source Code Yes Our source code is open-source and is available at: https://github.com/sacmehta/delight
Open Datasets Yes We benchmark De Ligh T models on four datasets: (1) IWSLT 14 German-English (De-En), (2) WMT 16 English-Romanian (En-Ro), (3) WMT 14 English-German (WMT 14 En-De), and (4) WMT 14 English-French (WMT 14 En-Fr)... We evaluate on the Wiki Text-103 dataset (Merity et al., 2017) that has 103M/217K/245K tokens for training, validation, and testing.
Dataset Splits Yes For the IWSLT 14 De-En dataset, we replicate the setup of Wu et al. (2019) and Edunov et al. (2018), which uses 160K/7K/7K sentence pairs for training, validation, and testing with a joint BPE vocabulary of about 10K tokens, respectively.
Hardware Specification Yes train all our models for 50K iterations with a batch size of 4K tokens on a single NVIDIA GTX 1080 GPU.
Software Dependencies No We implement our models using Fairseq (Ott et al., 2019) and use their provided scripts for data pre-processing, training, and evaluation. When we enabled the dedicated CUDA kernel provided by APEX library3 for multi-head attention in Transformers. No specific version numbers are provided for these software components.
Experiment Setup Yes For IWSLT 14 De-En models, we follow the setup of Wu et al. (2019) and train all our models for 50K iterations with a batch size of 4K tokens on a single NVIDIA GTX 1080 GPU... We use Adam (Kingma and Ba, 2015) to minimize cross entropy loss with a label smoothing value of 0.1 during training.