Sequence Generation with Mixed Representations

Authors: Lijun Wu, Shufang Xie, Yingce Xia, Yang Fan, Jian-Huang Lai, Tao Qin, Tieyan Liu

ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate our model and training algorithm in two standard sequence generation tasks, machine translation and abstractive summarization. On six translation language pairs and totally 12 translation tasks, such as English German, English Dutch and English Romanian, our approach outperforms baselines by more than 1.0 BLEU points. For abstractive summarization, experimental results also show consistent improvements compared with only one tokenizer utilized.
Researcher Affiliation Collaboration 1Microsoft Research, Beijing, China 2Sun Yat-sen University, Guangzhou, China 3University of Science and Technology of China, Hefei, China.
Pseudocode No The paper describes the architecture and algorithms using mathematical formulations and diagrams (e.g., Figure 1), but it does not include any section or block explicitly labeled as "Pseudocode" or "Algorithm".
Open Source Code Yes Our code is provided in https://github.com/apeterswu/fairseq_mix.
Open Datasets Yes We conduct experiments on standard translation tasks with multiple language pairs, which are English German (En De for short), English Dutch (En Nl for short), English Polish (En Pl for short), English Portuguese-Brazil (En Pt-br for short), English Turkish (En Tr for short), and English Romanian (En Ro for short) language pairs. These benchmark datasets all come from the widely acknowledged IWSLT-2014 machine translation (Cettolo et al., 2014) competition. ... We conduct the summarization experiment on a benchmark dataset, the Gigaword summarization dataset. The corpus is constructed from a subset of Gigaword corpus (Graff & Cieri, 2003).
Dataset Splits Yes The resulted datasets contains about 160k, 7k and 7k pairs for training, valid and test sets for En De task, 180k, 4.7k, 1.1k for En Ro task, 170k, 4.5k, 1.1k for En Nl task, 175k, 4.5k, 1.2k for En Pt-br task, 181k, 4.7k, 1.2k for En Pl task and 160k, 4.5k, 1k for En Tr task respectively.
Hardware Specification No The paper does not specify the exact hardware used for the experiments (e.g., CPU/GPU models, memory). It only states that the implementation is based on the Fairseq toolkit.
Software Dependencies No The paper mentions software like 'Moses toolkit', 'Fairseq (Ott et al., 2019) toolkit', and 'Adam (Kingma & Ba, 2014) optimizer', but it does not specify version numbers for these software components, which is required for reproducibility.
Experiment Setup Yes The embedding dimension is set as 512 and the size of feed-forward layer is 1024. Each encoder and decoder contain 6 layers for each side. ... The dropout rate (Srivastava et al., 2014) is 0.3 and weight decay is 0.0001 for all experiments. The model is optimized with Adam (Kingma & Ba, 2014) optimizer and the learning rate schedule is the same default setting used as in Vaswani et al. (2017). Label smoothing (Pereyra et al., 2017) is also used with weight 0.1.