reproducibilityindex.ai

BANG: Bridging Autoregressive and Non-autoregressive Generation with Large Scale Pretraining

Authors: Weizhen Qi, Yeyun Gong, Jian Jiao, Yu Yan, Weizhu Chen, Dayiheng Liu, Kewen Tang, Houqiang Li, Jiusheng Chen, Ruofei Zhang, Ming Zhou, Nan Duan

ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments on question generation (SQu AD 1.1), summarization (XSum) and dialogue generation (Persona Chat) show that BANG improves NAR and semi-NAR performance signiﬁcantly as well as attaining comparable performance with strong AR pretrained models.
Researcher Affiliation	Collaboration	1University of Science and Technology of China, Hefei, China 2During Internship at MSRA 3Microsoft Research Asia, Beijing, China 4Microsoft, Redmond, USA 5Sichuan University, Chengdu, China.
Pseudocode	Yes	Algorithm 1 Cross-stream Visible N-stream Self-attention
Open Source Code	Yes	In this paper, we propose a new model named BANG 1 to bridge the gap between AR and NAR via pretraining a generative model. 1https://github.com/microsoft/BANG
Open Datasets	Yes	XSum (Narayan et al., 2018) contains 227K online article and single sentence summary pairs from the British Broadcasting Corporation (BBC). SQu AD 1.1 (Rajpurkar et al., 2016) is a dataset created for machine reading comprehension. Persona Chat (Zhang et al., 2018) is a dataset created for multi-turn conversation with personalizing proﬁles.
Dataset Splits	Yes	We ﬁnetune BANG on each dataset for 10 epochs and select the best model according to the performance on their dev sets. We select the best checkpoint based on the performance on dev set.
Hardware Specification	Yes	For all downstream tasks, we use 8 NVIDIA Tesla V100 GPUs for ﬁnetuning and one single V100 GPU for inference.
Software Dependencies	Yes	All the experiments are conducted on the Fairseq (Ott et al., 2019) v0.9.0 codebase and we use the built-in time statistics function to calculate the per-sample inference latency.
Experiment Setup	Yes	We pretrain BANG from scratch with a learning rate of 3e-4 for 35 epochs and a batch size of 2048. BANG AR ﬁnetuning hyper-parameters are: learning rate 1e-4, warm up steps of 1000, Adam optimizer, the maximum input and output length of 512, and a label smoothness of 0.1. We ﬁnetune BANG on each dataset for 10 epochs and select the best model according to the performance on their dev sets. For inference, we set the beam size as 4, length penalty as 1.0 and batch size as 1 to calculate the latency.