Unified Discrete Diffusion for Simultaneous Vision-Language Generation

Authors: Minghui Hu, Chuanxia Zheng, Zuopeng Yang, Tat-Jen Cham, Heliang Zheng, Chaoyue Wang, Dacheng Tao, Ponnuthurai N. Suganthan

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments indicate that our proposed method can perform comparably to the state-of-the-art solutions in various generation tasks.
Researcher Affiliation Collaboration Nanyang Technological University, University of Oxford, Qatar University {e200008, ASTJCham}@ntu.edu.sg, cxzheng@robots.ox.ac.uk, p.n.suganthan@qu.edu.qa Zuopeng Yang , Heliang Zheng, Chaoyue Wang & Dacheng Tao JD Explore Academy, Shanghai Jiao Tong University yzpeng@sjtu.edu.cn, zhengheliang@jd.com, chaoyue.wang@outlook.com, dacheng.tao@gmail.com
Pseudocode No The paper describes the method and mathematical formulations but does not include any pseudocode or clearly labeled algorithm blocks.
Open Source Code Yes The inference codes are released at https://github.com/mhh0318/Uni D3.
Open Datasets Yes We demonstrate the feasibility of the proposed method on two commonly used datasets: CUB200 (Wah et al., 2011) and MSCOCO (Lin et al., 2014).
Dataset Splits No The paper specifies training and test image counts for CUB-200 and MSCOCO datasets but does not explicitly mention a separate validation split. 'The CUB-200 dataset consists of 8,855 training images and 2,933 test images representing 200 species of birds. In MSCOCO, there are 591,753 images utilized for training and 25,014 for testing'.
Hardware Specification Yes We trained all models with a batch size of 16 across 8 Tesla A100.
Software Dependencies No The paper mentions 'Adam W' as an optimizer but does not provide specific version numbers for software dependencies such as programming languages or libraries.
Experiment Setup Yes For the diffusion process, we set the number of diffusion steps to 500. And the noise planning is linear, where αt goes from 1 to 0 and γt goes from 0 to 1. The denoising network architecture is as described in Sec.3.2, in which the transformer comprises 20 transformer blocks with 16 heads attention and 1024 feature dimension. The model contains 600M parameters. For the ablation model, we use 18 transformer blocks with 16 heads, and the dimension is 256. The model contains 119M parameters. The optimiser for the model is Adam W (Loshchilov & Hutter, 2018), and the learning rate is 9e 4 without warmup. We trained all models with a batch size of 16 across 8 Tesla A100.