Any-to-Any Generation via Composable Diffusion
Authors: Zineng Tang, Ziyi Yang, Chenguang Zhu, Michael Zeng, Mohit Bansal
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate the any-to-any generation capability of Co Di, including single-to-single modality generation, multi-condition generation, and the novel capacity of joint generation of multiple modalities. For example, generating synchronized video and audio given the text input prompt; or generating video given a prompt image and audio. We also provide a quantitative evaluation of Co Di using eight multimodal datasets. |
| Researcher Affiliation | Collaboration | Zineng Tang1 Ziyi Yang2 Chenguang Zhu2 Michael Zeng2 Mohit Bansal1 1University of North Carolina at Chapel Hill 2Microsoft Azure Cognitive Services Research |
| Pseudocode | No | The paper includes architectural diagrams (Figure 2, 6, 7) and descriptions of methods, but no explicit pseudocode blocks or algorithms. |
| Open Source Code | Yes | The project page with demonstrations and code is at https://codi-gen.github.io/ [...] We will publicly release our code and checkpoints. |
| Open Datasets | Yes | We list training tasks of Co Di in Table 1, including single modality synthesis, joint multimodal generation, and contrastive learning to align prompt encoders. Table 1 provides an overview of the datasets, tasks, number of samples, and domain. [...] LAION-400M: Creative Common CC-BY 4.0 Audio Set: Creative Common CC-BY 4.0 Audio Caps: MIT Freesound: Creative Commons BBC Sound Effect: The BBC s Content Licence Sound Net: MIT Webvid10M: Webvid HD-Villa-100M: Research Use of Data Agreement v1.0 |
| Dataset Splits | Yes | We test on the validation set of Audio Caps [24] since all four modalities are present in this dataset. [...] The benchmark is the validation set of Audio Caps [24]. |
| Hardware Specification | No | The paper does not specify the hardware used for training or inference, such as specific GPU or CPU models. |
| Software Dependencies | No | The paper lists general software libraries like “Py Torch”, “Huggingface Transformers”, “Torchvision”, and “Torchaudio” along with their licenses, but does not provide specific version numbers for any of them. |
| Experiment Setup | Yes | Table 12: Hyperparameters for our diffusion models. [...] Diffusion Setup Diffusion steps 1000 [...] Learning rate 2e-5. [...] Section B Model Training [...] We use Adam [26] optimizer with learning rate 1e-4 and weight decay 1e-4. [...] We adopt curriculum learning on frame resolution and frames-per-second (FPS). First, the diffuser is trained on the Web Vid dataset of a 256-frame resolution, with the training objective being text-conditioned video generation. The training clips are sampled from 2-second video chunks with 4 FPS. Second, the model is further trained on HDVILLA and ACAV datasets, with a 512-frame resolution and 8 FPS, and the training objective is image-conditioned video generation (the image is a randomly sampled frame of the clip). Each training clip contains 16 frames sampled from a 2-second video chunk with 8 FPS. |