Images that Sound: Composing Images and Sounds on a Single Canvas
Authors: Ziyang Chen, Daniel Geng, Andrew Owens
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate our methods using quantitative metrics and human studies. We also present qualitative comparisons and an analysis of our method, and why it works. |
| Researcher Affiliation | Academia | Ziyang Chen Daniel Geng Andrew Owens University of Michigan |
| Pseudocode | Yes | Algorithm 1 Pseudocode in a Py Torch-like style for the imprint baseline. |
| Open Source Code | Yes | We provide all code in our supplemental material, including code for baselines. We will release all code on acceptance. |
| Open Datasets | Yes | We randomly select 5 discrete (onset-based) and 5 continuous sound category names from VGGSound Common [17] as audio prompts. We randomly chose 5 objects and 5 scene classes for image prompts, formatted as a painting of [class], grayscale . For the image model, we use Stable Diffusion v1.53 [96]. For the audio model, we use Auffusion4 [118]. |
| Dataset Splits | No | The paper describes generating samples based on prompts and evaluating them, but does not specify explicit training, validation, or test dataset splits for its own experimental process, as it leverages pre-trained models. |
| Hardware Specification | Yes | Our method is significantly faster, generating one sample in 10 seconds compared to the SDS baseline s 2-hour optimization time using NVIDIA L40s. |
| Software Dependencies | No | The paper mentions specific models like Stable Diffusion v1.5 and Deep Floyd IF, and algorithms like Hi Fi-GAN and Griffin-Lim, but does not provide version numbers for general software dependencies (e.g., Python, PyTorch) or for Hi Fi-GAN and Griffin-Lim implementations used. |
| Experiment Setup | Yes | We begin the reverse process with random latent noise z T R4 32 128, the same shape that Auffusion was trained on. ... We set the classifier guidance scales γv and γa to be between 7.5 and 10 and denoise the latents for 100 inference steps with warm-start parameters of ta = 1.0, tv = 0.9 to preserve audio priors. We decode the latent variables into images of dimension 3 256 1024. |