reproducibilityindex.ai

MASS: Masked Sequence to Sequence Pre-training for Language Generation

Authors: Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, Tie-Yan Liu

ICML 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	By further ﬁne-tuning on a variety of zero/low-resource language generation tasks, including neural machine translation, text summarization and conversational response generation (3 tasks and totally 8 datasets), MASS achieves signiﬁcant improvements over baselines without pre-training or with other pretraining methods. In this section, we describe the experimental details about MASS pre-training and ﬁne-tuning on a variety of language generation tasks, including NMT, text summarization, conversational response generation.
Researcher Affiliation	Collaboration	1Key Laboratory of Intelligent Perception and Systems for High-Dimensional Information of Ministry of Education, Nanjing University of Science and Technology 2Microsoft Research.
Pseudocode	No	The paper describes the methodology but does not include any structured pseudocode or algorithm blocks.
Open Source Code	Yes	We release the codes in https://github.com/microsoft/MASS.
Open Datasets	Yes	We use all of the monolingual data from WMT News Crawl datasets5, which covers 190M, 62M and 270M sentences from year 2007 to 2017 for English, French, German respectively. [...] The monolingual data for each language is downloaded from http://www.statmt.org/wmt16/translation-task.html. For the other two tasks, we conduct experiments on: 1) the Gigaword corpus for abstractive text summarization; 2) the Cornell Movie Dialog corpus for conversational response generation.
Dataset Splits	Yes	During evaluation, we calculate the BLEU score with multi-bleu.pl7 on newstest2014 for English-French, and newstest2016 for English-German and English-Romanian. We randomly sample 10K/20K pairs as the validation/test set and the remaining data is used for training.
Hardware Specification	Yes	The model are trained on 8 NVIDIA V100 GPU cards and each mini-batch contains 3000 tokens for pre-training.
Software Dependencies	No	The paper mentions implementing the method based on the XLM codebase and using `multi-bleu.pl` for evaluation, but it does not provide specific version numbers for software dependencies such as Python, PyTorch, or TensorFlow libraries.
Experiment Setup	Yes	We choose Transformer (Vaswani et al., 2017) as the basic model structure, which consists of 6-layer encoder and 6-layer decoder with 1024 embedding/hidden size and 4096 feed-forward ﬁlter size. ... We use Adam optimizer (Kingma & Ba, 2015) with a learning rate of 10 4 for the pre-training. The model are trained on 8 NVIDIA V100 GPU cards and each mini-batch contains 3000 tokens for pre-training.