BANG: Bridging Autoregressive and Non-autoregressive Generation with Large Scale Pretraining
Authors: Weizhen Qi, Yeyun Gong, Jian Jiao, Yu Yan, Weizhu Chen, Dayiheng Liu, Kewen Tang, Houqiang Li, Jiusheng Chen, Ruofei Zhang, Ming Zhou, Nan Duan
ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on question generation (SQu AD 1.1), summarization (XSum) and dialogue generation (Persona Chat) show that BANG improves NAR and semi-NAR performance significantly as well as attaining comparable performance with strong AR pretrained models. |
| Researcher Affiliation | Collaboration | 1University of Science and Technology of China, Hefei, China 2During Internship at MSRA 3Microsoft Research Asia, Beijing, China 4Microsoft, Redmond, USA 5Sichuan University, Chengdu, China. |
| Pseudocode | Yes | Algorithm 1 Cross-stream Visible N-stream Self-attention |
| Open Source Code | Yes | In this paper, we propose a new model named BANG 1 to bridge the gap between AR and NAR via pretraining a generative model. 1https://github.com/microsoft/BANG |
| Open Datasets | Yes | XSum (Narayan et al., 2018) contains 227K online article and single sentence summary pairs from the British Broadcasting Corporation (BBC). SQu AD 1.1 (Rajpurkar et al., 2016) is a dataset created for machine reading comprehension. Persona Chat (Zhang et al., 2018) is a dataset created for multi-turn conversation with personalizing profiles. |
| Dataset Splits | Yes | We finetune BANG on each dataset for 10 epochs and select the best model according to the performance on their dev sets. We select the best checkpoint based on the performance on dev set. |
| Hardware Specification | Yes | For all downstream tasks, we use 8 NVIDIA Tesla V100 GPUs for finetuning and one single V100 GPU for inference. |
| Software Dependencies | Yes | All the experiments are conducted on the Fairseq (Ott et al., 2019) v0.9.0 codebase and we use the built-in time statistics function to calculate the per-sample inference latency. |
| Experiment Setup | Yes | We pretrain BANG from scratch with a learning rate of 3e-4 for 35 epochs and a batch size of 2048. BANG AR finetuning hyper-parameters are: learning rate 1e-4, warm up steps of 1000, Adam optimizer, the maximum input and output length of 512, and a label smoothness of 0.1. We finetune BANG on each dataset for 10 epochs and select the best model according to the performance on their dev sets. For inference, we set the beam size as 4, length penalty as 1.0 and batch size as 1 to calculate the latency. |