MASS: Masked Sequence to Sequence Pre-training for Language Generation

Authors: Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, Tie-Yan Liu

ICML 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental By further fine-tuning on a variety of zero/low-resource language generation tasks, including neural machine translation, text summarization and conversational response generation (3 tasks and totally 8 datasets), MASS achieves significant improvements over baselines without pre-training or with other pretraining methods. In this section, we describe the experimental details about MASS pre-training and fine-tuning on a variety of language generation tasks, including NMT, text summarization, conversational response generation.
Researcher Affiliation Collaboration 1Key Laboratory of Intelligent Perception and Systems for High-Dimensional Information of Ministry of Education, Nanjing University of Science and Technology 2Microsoft Research.
Pseudocode No The paper describes the methodology but does not include any structured pseudocode or algorithm blocks.
Open Source Code Yes We release the codes in https://github.com/microsoft/MASS.
Open Datasets Yes We use all of the monolingual data from WMT News Crawl datasets5, which covers 190M, 62M and 270M sentences from year 2007 to 2017 for English, French, German respectively. [...] The monolingual data for each language is downloaded from http://www.statmt.org/wmt16/translation-task.html. For the other two tasks, we conduct experiments on: 1) the Gigaword corpus for abstractive text summarization; 2) the Cornell Movie Dialog corpus for conversational response generation.
Dataset Splits Yes During evaluation, we calculate the BLEU score with multi-bleu.pl7 on newstest2014 for English-French, and newstest2016 for English-German and English-Romanian. We randomly sample 10K/20K pairs as the validation/test set and the remaining data is used for training.
Hardware Specification Yes The model are trained on 8 NVIDIA V100 GPU cards and each mini-batch contains 3000 tokens for pre-training.
Software Dependencies No The paper mentions implementing the method based on the XLM codebase and using `multi-bleu.pl` for evaluation, but it does not provide specific version numbers for software dependencies such as Python, PyTorch, or TensorFlow libraries.
Experiment Setup Yes We choose Transformer (Vaswani et al., 2017) as the basic model structure, which consists of 6-layer encoder and 6-layer decoder with 1024 embedding/hidden size and 4096 feed-forward filter size. ... We use Adam optimizer (Kingma & Ba, 2015) with a learning rate of 10 4 for the pre-training. The model are trained on 8 NVIDIA V100 GPU cards and each mini-batch contains 3000 tokens for pre-training.