AR-Diffusion: Auto-Regressive Diffusion Model for Text Generation
Authors: Tong Wu, Zhihao Fan, Xiao Liu, Hai-Tao Zheng, Yeyun Gong, yelong shen, Jian Jiao, Juntao Li, zhongyu wei, Jian Guo, Nan Duan, Weizhu Chen
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In a series of experiments on various text generation tasks, including text summarization, machine translation, and common sense generation, AR-DIFFUSION clearly demonstrated its superiority over existing diffusion language models and that it can be 100 600 faster when achieving comparable results. Experimental results across various text generation tasks, such as text summarization, machine translation, and common sense generation, have consistently demonstrated that AR-DIFFUSION surpasses existing text diffusion models, including AR methods in terms of both quality and diversity. |
| Researcher Affiliation | Collaboration | Tong Wu1 , Zhihao Fan2 , Xiao Liu3, Hai-Tao Zheng1,8 , Yeyun Gong3 , Yelong Shen4, Jian Jiao5, Juntao Li6, Zhongyu Wei2, Jian Guo7 , Nan Duan3 , Weizhu Chen4 1Shezhen International Graduate School, Tsinghua University, 2 Fudan University, 3Microsoft Research Asia, 4Microsoft Azure AI, 5Microsoft, 6Soochow University, 7IDEA Research, 8Pengcheng Laboratory |
| Pseudocode | Yes | Algorithm 1 Training Process of AR-DIFFUSION. Algorithm 2 Inference Process of AR-DIFFUSION with the Skipping Mechanism. |
| Open Source Code | Yes | Our code is available at this https URL. |
| Open Datasets | Yes | In our experiments, we use the publicly available XSUM [Narayan et al., 2018] and CNN/DAILYMAIL Hermann et al. [2015] on GLGE6, which is also named as GLGE-Easy. We choose the IWSLT 2014 dataset and the data processing method is to follow the scripts provided by fairseq7. We use COMMONGEN8 dataset for evaluation. |
| Dataset Splits | No | The paper mentions a "COMMONGEN dev set" in Table 4, implying a development or validation set for that specific task. However, it does not explicitly state the dataset splits (e.g., percentages, sample counts) for training, validation, and testing across all datasets or a general methodology for creating such splits. |
| Hardware Specification | Yes | All experiments are implemented on 8 Tesla V100-32G. |
| Software Dependencies | No | Our model configuration is implemented based on Transformer-base [Vaswani et al., 2017]. For other tasks, we adopt the tokenizer and vocabulary of bert-base-uncased. In addition, we use the Adam W (weight decay = 0.0) optimizer and dropout is 0.2. The paper mentions various software components and models but does not provide specific version numbers for them (e.g., PyTorch, Transformer library, Bert model library versions). |
| Experiment Setup | Yes | Model Setup Our model configuration is implemented based on Transformer-base [Vaswani et al., 2017]. In particular, For XSUM and CNN/DAILYMAIL, we set the diffusion embedding dimension to 128. For IWSLT14, we use 64-dimensional diffusion embedding, 4 attention heads and 1024-dimensional feed-forward layers. For COMMONGEN, we adopt 64-dimensional diffusion embedding, 8 attention heads and 512-dimensional feed-forward layers. In the training phase, we employ a square-root noise schedule and 2,000 diffusion steps [Li et al., 2022a]. Our training parameters on different datasets are shown in Table 7. Batch Size = mini batch size Ngc GPU number, Optimized Steps = total steps / Ngc, and Ngc is gradient accumulation number. Dataset Lr & Schedule Batch Size Optimized Steps Target Length ... linear schedule warm up steps is 4,000 Ngc , where Ngc denotes gradient accumulation number. In addition, we use the Adam W (weight decay = 0.0) optimizer and dropout is 0.2. |