Fast Sampling via Discrete Non-Markov Diffusion Models with Predetermined Transition Time

Authors: Zixiang Chen, Huizhuo Yuan, Yongqian Li, Yiwen Kou, Junkai Zhang, Quanquan Gu

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on natural language generation and machine translation tasks demonstrate the superior performance of our method in terms of both generation speed and sample quality compared to existing methods for discrete diffusion models.
Researcher Affiliation Academia Zixiang Chen Huizhuo Yuan Yongqian Li Yiwen Kou Junkai Zhang Quanquan Gu Department of Computer Science University of California, Los Angeles Los Angeles, CA 90095 {chenzx19,hzyuan,yongqianl,evankou,jkzhang,qgu}@cs.ucla.edu
Pseudocode Yes Algorithm 1 Sampling From DNDM
Open Source Code Yes Codes are available at https://github.com/ uclaml/DNDM.
Open Datasets Yes Datasets. We use the following three datasets to compare with the baselines for machine translation tasks: (1) IWSLT14 DE-EN (Cettolo et al., 2014)... (2) WMT14 EN-DE (Bojar et al., 2014)... and (3) WMT16 EN-RO (Bojar et al., 2016)... The natural language generation task is evaluated on two language datasets following Hoogeboom et al. (2021b): text8 and enwik8.
Dataset Splits Yes The train-validation-test split is fixed across all experiments for all machine translation datasets to ensure fair comparison.
Hardware Specification Yes For the fairness of comparison, all the experiments are conducted using a single NVIDIA RTX A6000 GPU with 48 GB memory.
Software Dependencies No The paper mentions 'Fair Seq (Ott et al., 2019)', 'GPT2 model', and 'GPT2-large model', but does not provide specific version numbers for key software components or libraries.
Experiment Setup Yes In all experiments, the batch size is chosen to be 100. For RDM and RDM-k, our hyperparameter settings follow the original paper (Zheng et al., 2023) except for the batch size... We train 12-layer Transformers for both text8 and enwik8 datasets for 500 epochs with the cosine schedule... During training, we employ a learning rate of 0.0001, a weight decay parameter of 0.99, and the Adam optimizer.