MASS: Masked Sequence to Sequence Pre-training for Language Generation
Authors: Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, Tie-Yan Liu
ICML 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | By further fine-tuning on a variety of zero/low-resource language generation tasks, including neural machine translation, text summarization and conversational response generation (3 tasks and totally 8 datasets), MASS achieves significant improvements over baselines without pre-training or with other pretraining methods. In this section, we describe the experimental details about MASS pre-training and fine-tuning on a variety of language generation tasks, including NMT, text summarization, conversational response generation. |
| Researcher Affiliation | Collaboration | 1Key Laboratory of Intelligent Perception and Systems for High-Dimensional Information of Ministry of Education, Nanjing University of Science and Technology 2Microsoft Research. |
| Pseudocode | No | The paper describes the methodology but does not include any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | We release the codes in https://github.com/microsoft/MASS. |
| Open Datasets | Yes | We use all of the monolingual data from WMT News Crawl datasets5, which covers 190M, 62M and 270M sentences from year 2007 to 2017 for English, French, German respectively. [...] The monolingual data for each language is downloaded from http://www.statmt.org/wmt16/translation-task.html. For the other two tasks, we conduct experiments on: 1) the Gigaword corpus for abstractive text summarization; 2) the Cornell Movie Dialog corpus for conversational response generation. |
| Dataset Splits | Yes | During evaluation, we calculate the BLEU score with multi-bleu.pl7 on newstest2014 for English-French, and newstest2016 for English-German and English-Romanian. We randomly sample 10K/20K pairs as the validation/test set and the remaining data is used for training. |
| Hardware Specification | Yes | The model are trained on 8 NVIDIA V100 GPU cards and each mini-batch contains 3000 tokens for pre-training. |
| Software Dependencies | No | The paper mentions implementing the method based on the XLM codebase and using `multi-bleu.pl` for evaluation, but it does not provide specific version numbers for software dependencies such as Python, PyTorch, or TensorFlow libraries. |
| Experiment Setup | Yes | We choose Transformer (Vaswani et al., 2017) as the basic model structure, which consists of 6-layer encoder and 6-layer decoder with 1024 embedding/hidden size and 4096 feed-forward filter size. ... We use Adam optimizer (Kingma & Ba, 2015) with a learning rate of 10 4 for the pre-training. The model are trained on 8 NVIDIA V100 GPU cards and each mini-batch contains 3000 tokens for pre-training. |