reproducibilityindex.ai

DictFormer: Tiny Transformer with Shared Dictionary

Authors: Qian Lou, Ting Hua, Yen-Chang Hsu, Yilin Shen, Hongxia Jin

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments show that Dict Former reduces 3.6 to 8.9 model size with similar accuracy over multiple tasks, compared to Transformer.
Researcher Affiliation	Industry	Qian Lou, Ting Hua, Yen-Chang Hsu, Yilin Shen, Hongxia Jin Samsung Research America {qian.lou, ting.hua, yenchang.hsu, yilin.shen, hongxia.jin}@samsung.com
Pseudocode	No	The paper contains figures and mathematical equations but no explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Dict Former code is available at https://github.com/Sam NLP/Dict Former.
Open Datasets	Yes	Three machine translation benchmarks are tested: IWSLT 14 German-English (De-En), WMT 14 English to German (En-De), and WMT 14 English to France (En-Fr). For IWSLT 14 De-En, we adopt the same settings in Wu et al. (2020)... We evaluate Dict Former on CNN-Daily Mail dataset (Chen et al., 2016)... We also evaluate Dict Former on WIKITEXT-103 (Merity et al., 2016)...
Dataset Splits	Yes	For WMT 14 En-De, our models are trained with 4.5M sentence pairs, validated on newstest2013, and tested on newstest2014. The last 10 model checkpoints are averaged for testing and the lowest-perplexity model is picked for validation.
Hardware Specification	Yes	The training experiments of WMT, summarization, and language modeling are conducted on 8 NVIDIA Tesla V100 GPUs. IWSLT De-En is trained on two GPUs.
Software Dependencies	No	The paper mentions 'Fairseq s transformer implementation (Ott et al., 2019) is used as the backbone for the baseline model' but does not specify its version or other software dependencies with version numbers.
Experiment Setup	Yes	For machine translation tasks, a dropout of 0.3 is used and the dropout ratio is linearly scaled down when we shrink the dimension of the embeddings for WMT datasets. The learning rate linearly warms up from 10 7 to 10 3 followed by a cosine annealing with a single cycle and Adam optimizer... Machine translations tasks are trained for 50K steps, but language modelling tasks are trained for 286K steps.