reproducibilityindex.ai

Unified Discrete Diffusion for Simultaneous Vision-Language Generation

Authors: Minghui Hu, Chuanxia Zheng, Zuopeng Yang, Tat-Jen Cham, Heliang Zheng, Chaoyue Wang, Dacheng Tao, Ponnuthurai N. Suganthan

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments indicate that our proposed method can perform comparably to the state-of-the-art solutions in various generation tasks.
Researcher Affiliation	Collaboration	Nanyang Technological University, University of Oxford, Qatar University {e200008, ASTJCham}@ntu.edu.sg, cxzheng@robots.ox.ac.uk, p.n.suganthan@qu.edu.qa Zuopeng Yang , Heliang Zheng, Chaoyue Wang & Dacheng Tao JD Explore Academy, Shanghai Jiao Tong University yzpeng@sjtu.edu.cn, zhengheliang@jd.com, chaoyue.wang@outlook.com, dacheng.tao@gmail.com
Pseudocode	No	The paper describes the method and mathematical formulations but does not include any pseudocode or clearly labeled algorithm blocks.
Open Source Code	Yes	The inference codes are released at https://github.com/mhh0318/Uni D3.
Open Datasets	Yes	We demonstrate the feasibility of the proposed method on two commonly used datasets: CUB200 (Wah et al., 2011) and MSCOCO (Lin et al., 2014).
Dataset Splits	No	The paper specifies training and test image counts for CUB-200 and MSCOCO datasets but does not explicitly mention a separate validation split. 'The CUB-200 dataset consists of 8,855 training images and 2,933 test images representing 200 species of birds. In MSCOCO, there are 591,753 images utilized for training and 25,014 for testing'.
Hardware Specification	Yes	We trained all models with a batch size of 16 across 8 Tesla A100.
Software Dependencies	No	The paper mentions 'Adam W' as an optimizer but does not provide specific version numbers for software dependencies such as programming languages or libraries.
Experiment Setup	Yes	For the diffusion process, we set the number of diffusion steps to 500. And the noise planning is linear, where αt goes from 1 to 0 and γt goes from 0 to 1. The denoising network architecture is as described in Sec.3.2, in which the transformer comprises 20 transformer blocks with 16 heads attention and 1024 feature dimension. The model contains 600M parameters. For the ablation model, we use 18 transformer blocks with 16 heads, and the dimension is 256. The model contains 119M parameters. The optimiser for the model is Adam W (Loshchilov & Hutter, 2018), and the learning rate is 9e 4 without warmup. We trained all models with a batch size of 16 across 8 Tesla A100.