DiffuSeq: Sequence to Sequence Text Generation with Diffusion Models
Authors: Shansan Gong, Mukai Li, Jiangtao Feng, Zhiyong Wu, Lingpeng Kong
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Upon extensive evaluation over a wide range of SEQ2SEQ tasks, we find DIFFUSEQ achieving comparable or even better performance than six established baselines, including a state-of-the-art model that is based on pre-trained language models. Apart from quality, an intriguing property of DIFFUSEQ is its high diversity during generation, which is desired in many SEQ2SEQ tasks. We further include a theoretical analysis revealing the connection between DIFFUSEQ and autoregressive/non-autoregressive models. |
| Researcher Affiliation | Collaboration | Shansan Gong1, Mukai Li1, Jiangtao Feng1, Zhiyong Wu1, Lingpeng Kong2 1Shark-NLP, Shanghai AI Laboratory 2The University of Hong Kong |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code is available at https://github.com/Shark-NLP/Diffu Seq |
| Open Datasets | Yes | Open domain dialogue requires models to generate informative responses given a dialogue context. We use Commonsense Conversation Dataset (Zhou et al., 2018), which is extracted from Reddit single-round dialogs, with over 3 million conversational pairs. Question generation(QG) aims to generate questions given a context as input. To obtain sufficient training samples, we use the dataset Quasar-T (Dhingra et al., 2017) preprocessed by Lin et al. (2018), and then generate document-question pairs to obtain 119K training samples (details in Appendix D.1). Text simplification aims to revise the complex text into sequences with simplified grammar and word choice. Jiang et al. (2020) constructs a corpus consisting of 677K complexsimple sentences with revision alignment. Paraphrase task generates an alternative surface form in the same language expressing the same semantic content. We adopt widely used QQP 3 sourced from the community question answering forum Quora, with 147K positive pairs. |
| Dataset Splits | No | The paper mentions using a "dev set" for tuning in Appendix D.2, but does not provide specific split percentages, sample counts, or a detailed methodology for creating the validation split. |
| Hardware Specification | Yes | The experiment is deployed on NVIDIA A100 Tensor Core GPUs, and we use 4 GPUs on training and single GPU on sampling. |
| Software Dependencies | Yes | The implementation is based on NLTK5 and torchmetrics. The n-gram based metrics may fail to capture the semantic meaning of sentences, so we consider using BERTScore6. Specifically, we use microsoft/deberta-xlarge-mnli to help BERTScore correlate better with human scores. |
| Experiment Setup | Yes | Our DIFFUSEQ is based on the 12 layers of Transformer with 12 attention heads, where the time step embedding is plugged akin to the position embedding. The maximum sequence length is 128, with embedding dimension d = 128, diffusion steps T = 2, 000 and a square-root noise schedule. To reduce the out-of-vocabulary generation, we apply Byte Pair Encoding (Sennrich et al., 2016) to construct the vocabulary. After conducting the diversity beam search (DBS) (Vijayakumar et al., 2016) for the Transformer-base model and GPT model, we find that DBS does not always promote diversity over temperature sampling and therefore we list the best diversity results. We compute the accuracy metrics of DIFFUSEQ using MBR with the size of candidate samples |S| = 10. |