Discrete Contrastive Diffusion for Cross-Modal Music and Image Generation
Authors: Ye Zhu, Yu Wu, Kyle Olszewski, Jian Ren, Sergey Tulyakov, Yan Yan
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct experiments on three conditional generation tasks: dance-to-music generation, text-to-image synthesis, and class-conditioned image synthesis. ... The quantitative experimental results are shown in Tab. 1 and Tab. 2. |
| Researcher Affiliation | Collaboration | Ye Zhu Department of Computer Science Illinois Institute of Technology Chicago, IL 60616, USA yzhu96@hawk.iit.edu Yu Wu School of Computer Science Wuhan University Wuhan 430000, China wuyucs@whu.edu.cn Kyle Olszewski, Jian Ren, Sergey Tulyakov Snap Inc. Santa Monica, CA 90405, USA {kolszewski,jren,stulyakov}@snap.com Yan Yan Department of Computer Science Illinois Institute of Technology Chicago, IL 60616, USA yyan34@iit.edu |
| Pseudocode | Yes | Algorithm 1 Conditional Discrete Contrastive Diffusion Training. The referenced equations can be found in the main paper. |
| Open Source Code | No | The paper states 'For qualitative samples of synthesized dance music sequences, please refer to our anonymous page in the supplement with music samples,' indicating an anonymous link for review, but there is no explicit statement about releasing the code for the methodology or a public repository link. |
| Open Datasets | Yes | We use the AIST++ Li et al. (2021) dataset and the Tik Tok Dance-Music dataset Zhu et al. (2022a) for the dance-to-music experiments. ... We conduct text-to-image synthesis on CUB200 Wah et al. (2011) and MSCOCO datasets Lin et al. (2014). ... We also perform the class-conditioned image generation on Image Net Deng et al. (2009); Russakovsky et al. (2015). |
| Dataset Splits | Yes | We adopt the official cross-modality splits without overlapping music songs for both datasets. |
| Hardware Specification | Yes | For the dance2music task experiments on the AIST++ dataset, we use 4 NVIDIA RTX A5000 GPUs, and train the model for approximately 2 days. ... For the same experiments on the MSCOCO dataset, we run the experiments on Amazon Web Services (AWS) using 8 NVIDIA Tesla V100 GPUs. ... For the class-conditioned experiments on the Image Net, we use 8 NVIDIA Tesla V100 GPUs running on AWS. |
| Software Dependencies | No | The paper mentions using 'Adam W Loshchilov & Hutter (2017) optimizer' and implicitly relies on deep learning frameworks, but it does not specify version numbers for any programming languages, libraries, or other key software components. |
| Experiment Setup | Yes | We set the initial weight for the contrastive loss as λ = 5e 5. The number N of intraand inter-negative samples for each GT music sample is 10. ... The Adam W Loshchilov & Hutter (2017) optimizer with β1 = 0.9 and β2 = 0.96 is deployed in our training, with a learning rate of 4.5e 4. ... We adopt a truncation rate of 0.86 in our inference. |