reproducibilityindex.ai

DeLighT: Deep and Light-weight Transformer

Authors: Sachin Mehta, Marjan Ghazvininejad, Srinivasan Iyer, Luke Zettlemoyer, Hannaneh Hajishirzi

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments on benchmark machine translation and language modeling tasks show that De Ligh T matches or improves the performance of baseline Transformers with 2 to 3 times fewer parameters on average.
Researcher Affiliation	Collaboration	Sachin Mehta1, Marjan Ghazvininejad2, Srinivasan Iyer2, Luke Zettlemoyer1, 2, and Hannaneh Hajishirzi1,3 1University of Washington 2Facebook AI Research 3Allen Institute for AI
Pseudocode	Yes	Listing 1: "Naive implementation of GLT in Pytorch" and Listing 2: "Grouping kernel in CUDA" are provided in Appendix E.
Open Source Code	Yes	Our source code is open-source and is available at: https://github.com/sacmehta/delight
Open Datasets	Yes	We benchmark De Ligh T models on four datasets: (1) IWSLT 14 German-English (De-En), (2) WMT 16 English-Romanian (En-Ro), (3) WMT 14 English-German (WMT 14 En-De), and (4) WMT 14 English-French (WMT 14 En-Fr)... We evaluate on the Wiki Text-103 dataset (Merity et al., 2017) that has 103M/217K/245K tokens for training, validation, and testing.
Dataset Splits	Yes	For the IWSLT 14 De-En dataset, we replicate the setup of Wu et al. (2019) and Edunov et al. (2018), which uses 160K/7K/7K sentence pairs for training, validation, and testing with a joint BPE vocabulary of about 10K tokens, respectively.
Hardware Specification	Yes	train all our models for 50K iterations with a batch size of 4K tokens on a single NVIDIA GTX 1080 GPU.
Software Dependencies	No	We implement our models using Fairseq (Ott et al., 2019) and use their provided scripts for data pre-processing, training, and evaluation. When we enabled the dedicated CUDA kernel provided by APEX library3 for multi-head attention in Transformers. No specific version numbers are provided for these software components.
Experiment Setup	Yes	For IWSLT 14 De-En models, we follow the setup of Wu et al. (2019) and train all our models for 50K iterations with a batch size of 4K tokens on a single NVIDIA GTX 1080 GPU... We use Adam (Kingma and Ba, 2015) to minimize cross entropy loss with a label smoothing value of 0.1 during training.