DeFINE: Deep Factorized Input Token Embeddings for Neural Sequence Modeling

Authors: Sachin Mehta, Rik Koncel-Kedziorski, Mohammad Rastegari, Hannaneh Hajishirzi

ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 4 EXPERIMENTAL RESULTS We demonstrate the performance of De FINE on two sequence modeling tasks: language modeling (Section 4.1) and machine translation (Section 4.2). We compare the performance of De FINE with existing factorization and compression-based methods in Section 4.3. We also provide ablations in Section 4.4 to show the effectiveness of our design decisions.
Researcher Affiliation Collaboration Sachin Mehta1, Rik Koncel-Kedziorski1, Mohammad Rastegari1, and Hannaneh Hajishirzi1,2 1University of Washington 2Allen Institute for AI
Pseudocode No No explicitly labeled 'Pseudocode' or 'Algorithm' blocks were found.
Open Source Code No The paper does not explicitly state that the source code for the described methodology is publicly available, nor does it provide a link.
Open Datasets Yes The Wiki Text-103 dataset (Merity et al., 2017) consists of 103M/217K/245K tokens for training, validation, and test sets respectively and has a vocabulary size of about 260K. This dataset is composed of Wikipedia articles and retains punctuation, numbers, and case. The Penn Treebank dataset (Marcus et al., 1994) contains about 929K/74K/82K tokens in its train, validation, and test sets respectively. It has a vocabulary size of about 10K. We use the WMT 2014 English-German (EN-DE) dataset (Luong et al., 2015) for training.
Dataset Splits Yes The Wiki Text-103 dataset (Merity et al., 2017) consists of 103M/217K/245K tokens for training, validation, and test sets respectively and has a vocabulary size of about 260K. The Penn Treebank dataset (Marcus et al., 1994) contains about 929K/74K/82K tokens in its train, validation, and test sets respectively. It has a vocabulary size of about 10K.
Hardware Specification Yes For training LSTM-based language models, we use a single NVIDIA GTX 1080 Ti GPU with 11 GB GPU memory while for training Transformer-XL, we used four Ge Force RTX 2080 Ti GPUs, each with 11 GB of GPU memory (as recommended by authors).
Software Dependencies Yes We train our models using Py Torch (v1.2).
Experiment Setup Yes For LSTM-based language models, we use similar hyper-parameters as Merity et al. (2018a) which are summarized in Section 9. Table 9: Hyper-parameters for training word-level LSTM-based language model on Wiki Text-103. These settings are similar to Merity et al. (2018a). Specific parameters listed in the table include: '# of GPUs 1 Weight decay 0 Optimizer SGD LR 20 BPTT Length 140 Batch size 60 Epochs 20 LR reduction (factor, steps) 10, [15] LSTM Hidden Dimension 1024 # of LSTM Layers 4 Max. dimension of ˆei (k) 1024 Dropout Same as Merity et al. (2018a)'.