Discrete Contrastive Diffusion for Cross-Modal Music and Image Generation

Authors: Ye Zhu, Yu Wu, Kyle Olszewski, Jian Ren, Sergey Tulyakov, Yan Yan

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct experiments on three conditional generation tasks: dance-to-music generation, text-to-image synthesis, and class-conditioned image synthesis. ... The quantitative experimental results are shown in Tab. 1 and Tab. 2.
Researcher Affiliation Collaboration Ye Zhu Department of Computer Science Illinois Institute of Technology Chicago, IL 60616, USA yzhu96@hawk.iit.edu Yu Wu School of Computer Science Wuhan University Wuhan 430000, China wuyucs@whu.edu.cn Kyle Olszewski, Jian Ren, Sergey Tulyakov Snap Inc. Santa Monica, CA 90405, USA {kolszewski,jren,stulyakov}@snap.com Yan Yan Department of Computer Science Illinois Institute of Technology Chicago, IL 60616, USA yyan34@iit.edu
Pseudocode Yes Algorithm 1 Conditional Discrete Contrastive Diffusion Training. The referenced equations can be found in the main paper.
Open Source Code No The paper states 'For qualitative samples of synthesized dance music sequences, please refer to our anonymous page in the supplement with music samples,' indicating an anonymous link for review, but there is no explicit statement about releasing the code for the methodology or a public repository link.
Open Datasets Yes We use the AIST++ Li et al. (2021) dataset and the Tik Tok Dance-Music dataset Zhu et al. (2022a) for the dance-to-music experiments. ... We conduct text-to-image synthesis on CUB200 Wah et al. (2011) and MSCOCO datasets Lin et al. (2014). ... We also perform the class-conditioned image generation on Image Net Deng et al. (2009); Russakovsky et al. (2015).
Dataset Splits Yes We adopt the official cross-modality splits without overlapping music songs for both datasets.
Hardware Specification Yes For the dance2music task experiments on the AIST++ dataset, we use 4 NVIDIA RTX A5000 GPUs, and train the model for approximately 2 days. ... For the same experiments on the MSCOCO dataset, we run the experiments on Amazon Web Services (AWS) using 8 NVIDIA Tesla V100 GPUs. ... For the class-conditioned experiments on the Image Net, we use 8 NVIDIA Tesla V100 GPUs running on AWS.
Software Dependencies No The paper mentions using 'Adam W Loshchilov & Hutter (2017) optimizer' and implicitly relies on deep learning frameworks, but it does not specify version numbers for any programming languages, libraries, or other key software components.
Experiment Setup Yes We set the initial weight for the contrastive loss as λ = 5e 5. The number N of intraand inter-negative samples for each GT music sample is 10. ... The Adam W Loshchilov & Hutter (2017) optimizer with β1 = 0.9 and β2 = 0.96 is deployed in our training, with a learning rate of 4.5e 4. ... We adopt a truncation rate of 0.86 in our inference.