DictFormer: Tiny Transformer with Shared Dictionary
Authors: Qian Lou, Ting Hua, Yen-Chang Hsu, Yilin Shen, Hongxia Jin
ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments show that Dict Former reduces 3.6 to 8.9 model size with similar accuracy over multiple tasks, compared to Transformer. |
| Researcher Affiliation | Industry | Qian Lou, Ting Hua, Yen-Chang Hsu, Yilin Shen, Hongxia Jin Samsung Research America {qian.lou, ting.hua, yenchang.hsu, yilin.shen, hongxia.jin}@samsung.com |
| Pseudocode | No | The paper contains figures and mathematical equations but no explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Dict Former code is available at https://github.com/Sam NLP/Dict Former. |
| Open Datasets | Yes | Three machine translation benchmarks are tested: IWSLT 14 German-English (De-En), WMT 14 English to German (En-De), and WMT 14 English to France (En-Fr). For IWSLT 14 De-En, we adopt the same settings in Wu et al. (2020)... We evaluate Dict Former on CNN-Daily Mail dataset (Chen et al., 2016)... We also evaluate Dict Former on WIKITEXT-103 (Merity et al., 2016)... |
| Dataset Splits | Yes | For WMT 14 En-De, our models are trained with 4.5M sentence pairs, validated on newstest2013, and tested on newstest2014. The last 10 model checkpoints are averaged for testing and the lowest-perplexity model is picked for validation. |
| Hardware Specification | Yes | The training experiments of WMT, summarization, and language modeling are conducted on 8 NVIDIA Tesla V100 GPUs. IWSLT De-En is trained on two GPUs. |
| Software Dependencies | No | The paper mentions 'Fairseq s transformer implementation (Ott et al., 2019) is used as the backbone for the baseline model' but does not specify its version or other software dependencies with version numbers. |
| Experiment Setup | Yes | For machine translation tasks, a dropout of 0.3 is used and the dropout ratio is linearly scaled down when we shrink the dimension of the embeddings for WMT datasets. The learning rate linearly warms up from 10 7 to 10 3 followed by a cosine annealing with a single cycle and Adam optimizer... Machine translations tasks are trained for 50K steps, but language modelling tasks are trained for 286K steps. |