IOT: Instance-wise Layer Reordering for Transformer Structures

Authors: Jinhua Zhu, Lijun Wu, Yingce Xia, Shufang Xie, Tao Qin, Wengang Zhou, Houqiang Li, Tie-Yan Liu

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on 3 tasks (neural machine translation, abstractive summarization, and code generation) and 9 datasets demonstrate consistent improvements of our method.
Researcher Affiliation Collaboration Jinhua Zhu1, , Lijun Wu2, , Yingce Xia2, Shufang Xie2, Tao Qin2, Wengang Zhou1, Houqiang Li1, Tie-Yan Liu2 1University of Science and Technology of China; 2Microsoft Research;
Pseudocode No The paper describes algorithms and processes using mathematical formulas and descriptive text, but it does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code Yes Our code is released at Github1. 1https://github.com/instance-wise-ordered-transformer/IOT
Open Datasets Yes For the low-resource scenario, we conduct experiments on IWSLT14 English German (En De), English Spanish (En Es), IWSLT17 English French (En Fr), English Chinese (En Zh) translations... For the rich-resource scenario, we work on WMT14 En De and WMT16 Romanian English (Ro En) translations... We work on one Java (Hu et al., 2018) and one Python dataset (Wan et al., 2018)... The dataset we utilized is a widely acknowledged one: Gigaword summarization, which is constructed from a subset of Gigaword corpus (Graff et al., 2003) and first used by Rush et al. (2017).
Dataset Splits Yes For WMT14 En De, we filter out 4.5M sentence pairs for training and concatenate newstest2012 and newstest2013 as dev set, newstest2014 as test set. For WMT16 Ro En, we concatenate the 0.6M bilingual pairs and 2.0M back translated data2 for training, newsdev2016/newstest2016 serve as dev/test set. ... We split each dataset with ratio 0.8 : 0.1 : 0.1 as training, dev and test set. ... The training data consists of 3.8M article-headline pairs, while the dev and test set consist of 190k and 2k pairs respectively.
Hardware Specification Yes The study is performed on a single Tesla P100 GPU card.
Software Dependencies No The paper mentions 'Implementation is developed on Fairseq (Ott et al., 2019)' and refers to 'Adam (Kingma & Ba, 2014) optimizer', but it does not specify version numbers for Fairseq, Adam, or any other software libraries, programming languages, or environments.
Experiment Setup Yes For IWSLT translation tasks, we use transformer iwslt de en setting as model configuration. The number of block, embedding size and feed-forward network (FFN) size are 6, 512, 1024. WMT tasks use transformer vaswani wmt en de big configuration, with 6 blocks, embedding size 1024 and FFN size 4096. Optimization and learning scheduler are the default settings in Vaswani et al. (2017). For code generation, block number/embedding size/FFN size are 3, 256, 1024 respectively. ... Dropout (Srivastava et al., 2014) is set to be 0.3. Other settings are also the same as NMT task. ... We first grid search c1, c2 on IWSLT14 De En dev set, and then apply them on other tasks. The best setting is c1 = 0.1, c2 = 0.01... We adopt the default optimization setting in Vaswani et al. (2017). Adam (Kingma & Ba, 2014) optimizer with β1 = 0.9, β2 = 0.98 and ϵ = 10 9. The learning rate scheduler is inverse sqrt with warmup steps 4, 000, default learning rate is 0.0005. Label smoothing (Szegedy et al., 2016) is used with value 0.1. As introduced, to learn the predictors, we clamp the softmax output with value 0.05.