DeFINE: Deep Factorized Input Token Embeddings for Neural Sequence Modeling
Authors: Sachin Mehta, Rik Koncel-Kedziorski, Mohammad Rastegari, Hannaneh Hajishirzi
ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 4 EXPERIMENTAL RESULTS We demonstrate the performance of De FINE on two sequence modeling tasks: language modeling (Section 4.1) and machine translation (Section 4.2). We compare the performance of De FINE with existing factorization and compression-based methods in Section 4.3. We also provide ablations in Section 4.4 to show the effectiveness of our design decisions. |
| Researcher Affiliation | Collaboration | Sachin Mehta1, Rik Koncel-Kedziorski1, Mohammad Rastegari1, and Hannaneh Hajishirzi1,2 1University of Washington 2Allen Institute for AI |
| Pseudocode | No | No explicitly labeled 'Pseudocode' or 'Algorithm' blocks were found. |
| Open Source Code | No | The paper does not explicitly state that the source code for the described methodology is publicly available, nor does it provide a link. |
| Open Datasets | Yes | The Wiki Text-103 dataset (Merity et al., 2017) consists of 103M/217K/245K tokens for training, validation, and test sets respectively and has a vocabulary size of about 260K. This dataset is composed of Wikipedia articles and retains punctuation, numbers, and case. The Penn Treebank dataset (Marcus et al., 1994) contains about 929K/74K/82K tokens in its train, validation, and test sets respectively. It has a vocabulary size of about 10K. We use the WMT 2014 English-German (EN-DE) dataset (Luong et al., 2015) for training. |
| Dataset Splits | Yes | The Wiki Text-103 dataset (Merity et al., 2017) consists of 103M/217K/245K tokens for training, validation, and test sets respectively and has a vocabulary size of about 260K. The Penn Treebank dataset (Marcus et al., 1994) contains about 929K/74K/82K tokens in its train, validation, and test sets respectively. It has a vocabulary size of about 10K. |
| Hardware Specification | Yes | For training LSTM-based language models, we use a single NVIDIA GTX 1080 Ti GPU with 11 GB GPU memory while for training Transformer-XL, we used four Ge Force RTX 2080 Ti GPUs, each with 11 GB of GPU memory (as recommended by authors). |
| Software Dependencies | Yes | We train our models using Py Torch (v1.2). |
| Experiment Setup | Yes | For LSTM-based language models, we use similar hyper-parameters as Merity et al. (2018a) which are summarized in Section 9. Table 9: Hyper-parameters for training word-level LSTM-based language model on Wiki Text-103. These settings are similar to Merity et al. (2018a). Specific parameters listed in the table include: '# of GPUs 1 Weight decay 0 Optimizer SGD LR 20 BPTT Length 140 Batch size 60 Epochs 20 LR reduction (factor, steps) 10, [15] LSTM Hidden Dimension 1024 # of LSTM Layers 4 Max. dimension of ˆei (k) 1024 Dropout Same as Merity et al. (2018a)'. |