BANG: Bridging Autoregressive and Non-autoregressive Generation with Large Scale Pretraining

Authors: Weizhen Qi, Yeyun Gong, Jian Jiao, Yu Yan, Weizhu Chen, Dayiheng Liu, Kewen Tang, Houqiang Li, Jiusheng Chen, Ruofei Zhang, Ming Zhou, Nan Duan

ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on question generation (SQu AD 1.1), summarization (XSum) and dialogue generation (Persona Chat) show that BANG improves NAR and semi-NAR performance significantly as well as attaining comparable performance with strong AR pretrained models.
Researcher Affiliation Collaboration 1University of Science and Technology of China, Hefei, China 2During Internship at MSRA 3Microsoft Research Asia, Beijing, China 4Microsoft, Redmond, USA 5Sichuan University, Chengdu, China.
Pseudocode Yes Algorithm 1 Cross-stream Visible N-stream Self-attention
Open Source Code Yes In this paper, we propose a new model named BANG 1 to bridge the gap between AR and NAR via pretraining a generative model. 1https://github.com/microsoft/BANG
Open Datasets Yes XSum (Narayan et al., 2018) contains 227K online article and single sentence summary pairs from the British Broadcasting Corporation (BBC). SQu AD 1.1 (Rajpurkar et al., 2016) is a dataset created for machine reading comprehension. Persona Chat (Zhang et al., 2018) is a dataset created for multi-turn conversation with personalizing profiles.
Dataset Splits Yes We finetune BANG on each dataset for 10 epochs and select the best model according to the performance on their dev sets. We select the best checkpoint based on the performance on dev set.
Hardware Specification Yes For all downstream tasks, we use 8 NVIDIA Tesla V100 GPUs for finetuning and one single V100 GPU for inference.
Software Dependencies Yes All the experiments are conducted on the Fairseq (Ott et al., 2019) v0.9.0 codebase and we use the built-in time statistics function to calculate the per-sample inference latency.
Experiment Setup Yes We pretrain BANG from scratch with a learning rate of 3e-4 for 35 epochs and a batch size of 2048. BANG AR finetuning hyper-parameters are: learning rate 1e-4, warm up steps of 1000, Adam optimizer, the maximum input and output length of 512, and a label smoothness of 0.1. We finetune BANG on each dataset for 10 epochs and select the best model according to the performance on their dev sets. For inference, we set the beam size as 4, length penalty as 1.0 and batch size as 1 to calculate the latency.