reproducibilityindex.ai

IOT: Instance-wise Layer Reordering for Transformer Structures

Authors: Jinhua Zhu, Lijun Wu, Yingce Xia, Shufang Xie, Tao Qin, Wengang Zhou, Houqiang Li, Tie-Yan Liu

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments on 3 tasks (neural machine translation, abstractive summarization, and code generation) and 9 datasets demonstrate consistent improvements of our method.
Researcher Affiliation	Collaboration	Jinhua Zhu1, , Lijun Wu2, , Yingce Xia2, Shufang Xie2, Tao Qin2, Wengang Zhou1, Houqiang Li1, Tie-Yan Liu2 1University of Science and Technology of China; 2Microsoft Research;
Pseudocode	No	The paper describes algorithms and processes using mathematical formulas and descriptive text, but it does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code	Yes	Our code is released at Github1. 1https://github.com/instance-wise-ordered-transformer/IOT
Open Datasets	Yes	For the low-resource scenario, we conduct experiments on IWSLT14 English German (En De), English Spanish (En Es), IWSLT17 English French (En Fr), English Chinese (En Zh) translations... For the rich-resource scenario, we work on WMT14 En De and WMT16 Romanian English (Ro En) translations... We work on one Java (Hu et al., 2018) and one Python dataset (Wan et al., 2018)... The dataset we utilized is a widely acknowledged one: Gigaword summarization, which is constructed from a subset of Gigaword corpus (Graff et al., 2003) and ﬁrst used by Rush et al. (2017).
Dataset Splits	Yes	For WMT14 En De, we ﬁlter out 4.5M sentence pairs for training and concatenate newstest2012 and newstest2013 as dev set, newstest2014 as test set. For WMT16 Ro En, we concatenate the 0.6M bilingual pairs and 2.0M back translated data2 for training, newsdev2016/newstest2016 serve as dev/test set. ... We split each dataset with ratio 0.8 : 0.1 : 0.1 as training, dev and test set. ... The training data consists of 3.8M article-headline pairs, while the dev and test set consist of 190k and 2k pairs respectively.
Hardware Specification	Yes	The study is performed on a single Tesla P100 GPU card.
Software Dependencies	No	The paper mentions 'Implementation is developed on Fairseq (Ott et al., 2019)' and refers to 'Adam (Kingma & Ba, 2014) optimizer', but it does not specify version numbers for Fairseq, Adam, or any other software libraries, programming languages, or environments.
Experiment Setup	Yes	For IWSLT translation tasks, we use transformer iwslt de en setting as model conﬁguration. The number of block, embedding size and feed-forward network (FFN) size are 6, 512, 1024. WMT tasks use transformer vaswani wmt en de big conﬁguration, with 6 blocks, embedding size 1024 and FFN size 4096. Optimization and learning scheduler are the default settings in Vaswani et al. (2017). For code generation, block number/embedding size/FFN size are 3, 256, 1024 respectively. ... Dropout (Srivastava et al., 2014) is set to be 0.3. Other settings are also the same as NMT task. ... We ﬁrst grid search c1, c2 on IWSLT14 De En dev set, and then apply them on other tasks. The best setting is c1 = 0.1, c2 = 0.01... We adopt the default optimization setting in Vaswani et al. (2017). Adam (Kingma & Ba, 2014) optimizer with β1 = 0.9, β2 = 0.98 and ϵ = 10 9. The learning rate scheduler is inverse sqrt with warmup steps 4, 000, default learning rate is 0.0005. Label smoothing (Szegedy et al., 2016) is used with value 0.1. As introduced, to learn the predictors, we clamp the softmax output with value 0.05.