Understanding and Improving Encoder Layer Fusion in Sequence-to-Sequence Learning

Authors: Xuebo Liu, Longyue Wang, Derek F. Wong, Liang Ding, Lidia S. Chao, Zhaopeng Tu

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results show that Surface Fusion outperforms Encoder Fusion on several NLP benchmarks, including machine translation, text summarization, and grammatical error correction. It obtains the state-of-the-art performance on WMT16 Romanian-English and WMT14 English-French translation tasks. Extensive analyses reveal that Surface Fusion learns more expressive bilingual word embeddings by building a closer relationship between relevant source and target embeddings. Source code is freely available at https://github.com/Sunbow Liu/Surface Fusion.
Researcher Affiliation Collaboration Xuebo Liu1 , Longyue Wang2, Derek F. Wong1, Liang Ding3, Lidia S. Chao1 & Zhaopeng Tu2 1NLP2CT Lab, Department of Computer and Information Science, University of Macau 2Tencent AI Lab 3The University of Sydney
Pseudocode No The paper does not contain any sections or figures explicitly labeled 'Pseudocode' or 'Algorithm', nor does it present structured steps in a code-like format.
Open Source Code Yes Source code is freely available at https://github.com/Sunbow Liu/Surface Fusion.
Open Datasets Yes We conducted experiments on three benchmarking datasets: smallscale WMT16 Romanian-English (Ro-En; 0.6M instances), medium-scale WMT14 English-German (En-De; 4.5M instances), and large-scale WMT14 English-French (En-Fr; 36.0M instances). ... We used the CNN/Daily Mail corpus (0.3M instances). ... We used CONLL14 datasets as the testbed (1.4M instances). For WMT16 Romanian-English, we used the prepossessed data1 and existing result from Ghazvininejad et al. (2019). ... For WMT14 English-German, the prepossessed data2 and existing result are derived from Ott et al. (2018). ... For WMT14 English-French, we reported the existing result from Ott et al. (2018) and followed them to preprocess the data sets. ... For CNN/Daily Mail dataset, we used the existing result and preprocessing method of Ott et al. (2019). ... For CONLL14 benchmark, the preprocessing script3 and existing result are given by Chollampatt & Ng (2018). The footnotes provide links to these preprocessed data sources.
Dataset Splits Yes Table 4: Statistics of the datasets and hyperparameters for the experiments. ... Ro-En ... Training 0.6M Testing 2K Dev 2K ... En-De ... Training 4.5M Testing 3K Dev 3K ... En-Fr ... Training 35.5M Testing 6K Dev 3K ... CNN/DM ... Training 0.3M Testing 13K Dev 11K ... CONLL ... Training 1.3M Testing 5K Dev 1K. We chose the checkpoint with best validation ppl for testing.
Hardware Specification No The paper does not provide any specific details about the hardware used for running the experiments (e.g., GPU models, CPU types, memory). It only states that 'All the models are implemented by the open-source toolkit fairseq'.
Software Dependencies No The paper states that 'All the models are implemented by the open-source toolkit fairseq (Ott et al., 2019).' and provides a link to its GitHub repository. However, it does not specify the version number of fairseq or other critical software components like Python, PyTorch, or TensorFlow versions.
Experiment Setup Yes Table 4: Statistics of the datasets and hyperparameters for the experiments. All the data have been tokenized and split into joint sub-word units (Sennrich et al., 2016). Batch denotes the number of source tokens and target tokens used in each training step. DP denotes the dropout value (Srivastava et al., 2014). LP denotes the length penalty (Wu et al., 2016). Base and Big denote the two kinds of model variants of Transformer. We chose the checkpoint with best validation ppl for testing. ... It was set to 0.3 for all the experiments. ... We find that λ is sensitive to the corpus scale but insensitive to the relationship of input/output domain, which was set to 0.9 for the En-De, En-Fr and correction tasks, and 0.8 for the Ro-En and summarization tasks. For τ, it was set to 5 for soft fusion and 1 for hard fusion across different benchmarks. ... It is noted that other unmentioned hyperparameters keep the same with the original paper of Transformer (Vaswani et al., 2017).