IOT: Instance-wise Layer Reordering for Transformer Structures
Authors: Jinhua Zhu, Lijun Wu, Yingce Xia, Shufang Xie, Tao Qin, Wengang Zhou, Houqiang Li, Tie-Yan Liu
ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on 3 tasks (neural machine translation, abstractive summarization, and code generation) and 9 datasets demonstrate consistent improvements of our method. |
| Researcher Affiliation | Collaboration | Jinhua Zhu1, , Lijun Wu2, , Yingce Xia2, Shufang Xie2, Tao Qin2, Wengang Zhou1, Houqiang Li1, Tie-Yan Liu2 1University of Science and Technology of China; 2Microsoft Research; |
| Pseudocode | No | The paper describes algorithms and processes using mathematical formulas and descriptive text, but it does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks. |
| Open Source Code | Yes | Our code is released at Github1. 1https://github.com/instance-wise-ordered-transformer/IOT |
| Open Datasets | Yes | For the low-resource scenario, we conduct experiments on IWSLT14 English German (En De), English Spanish (En Es), IWSLT17 English French (En Fr), English Chinese (En Zh) translations... For the rich-resource scenario, we work on WMT14 En De and WMT16 Romanian English (Ro En) translations... We work on one Java (Hu et al., 2018) and one Python dataset (Wan et al., 2018)... The dataset we utilized is a widely acknowledged one: Gigaword summarization, which is constructed from a subset of Gigaword corpus (Graff et al., 2003) and first used by Rush et al. (2017). |
| Dataset Splits | Yes | For WMT14 En De, we filter out 4.5M sentence pairs for training and concatenate newstest2012 and newstest2013 as dev set, newstest2014 as test set. For WMT16 Ro En, we concatenate the 0.6M bilingual pairs and 2.0M back translated data2 for training, newsdev2016/newstest2016 serve as dev/test set. ... We split each dataset with ratio 0.8 : 0.1 : 0.1 as training, dev and test set. ... The training data consists of 3.8M article-headline pairs, while the dev and test set consist of 190k and 2k pairs respectively. |
| Hardware Specification | Yes | The study is performed on a single Tesla P100 GPU card. |
| Software Dependencies | No | The paper mentions 'Implementation is developed on Fairseq (Ott et al., 2019)' and refers to 'Adam (Kingma & Ba, 2014) optimizer', but it does not specify version numbers for Fairseq, Adam, or any other software libraries, programming languages, or environments. |
| Experiment Setup | Yes | For IWSLT translation tasks, we use transformer iwslt de en setting as model configuration. The number of block, embedding size and feed-forward network (FFN) size are 6, 512, 1024. WMT tasks use transformer vaswani wmt en de big configuration, with 6 blocks, embedding size 1024 and FFN size 4096. Optimization and learning scheduler are the default settings in Vaswani et al. (2017). For code generation, block number/embedding size/FFN size are 3, 256, 1024 respectively. ... Dropout (Srivastava et al., 2014) is set to be 0.3. Other settings are also the same as NMT task. ... We first grid search c1, c2 on IWSLT14 De En dev set, and then apply them on other tasks. The best setting is c1 = 0.1, c2 = 0.01... We adopt the default optimization setting in Vaswani et al. (2017). Adam (Kingma & Ba, 2014) optimizer with β1 = 0.9, β2 = 0.98 and ϵ = 10 9. The learning rate scheduler is inverse sqrt with warmup steps 4, 000, default learning rate is 0.0005. Label smoothing (Szegedy et al., 2016) is used with value 0.1. As introduced, to learn the predictors, we clamp the softmax output with value 0.05. |