V2Meow: Meowing to the Visual Beat via Video-to-Music Generation
Authors: Kun Su, Judith Yue Li, Qingqing Huang, Dima Kuzmin, Joonseok Lee, Chris Donahue, Fei Sha, Aren Jansen, Yu Wang, Mauro Verzetti, Timo Denk
AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Trained on 5k hours of music audio clips paired with video frames mined from in-the-wild music videos, V2Meow is competitive with previous domain-specific models when evaluated in a zero-shot manner. Through both qualitative and quantitative evaluations, we demonstrate that our model outperforms various existing music generation systems in terms of visual-audio correspondence and audio quality. |
| Researcher Affiliation | Collaboration | 1Google Research, Mountain View, CA, USA 2Google Deep Mind, Mountain View, CA, USA 3University of Washington, Seattle, WA, USA 4Byte Dance, San Jose, CA, USA 5Seoul National University, Seoul, South Korea 6Carnegie Mellon University, Pittsburgh, PA, USA 7Music and Audio Research Laboratory, New York University, NY, USA |
| Pseudocode | No | The paper does not contain any pseudocode or algorithm blocks. |
| Open Source Code | No | The paper states 'Music samples are available at tinyurl.com/v2meow' but does not provide a link or statement about the availability of the source code for the described methodology. |
| Open Datasets | Yes | Training Datasets. Following (Sur ıs et al. 2022), we filtered a public available video dataset (Abu-El-Haija et al. 2016) to 110k videos with the label Music Videos and refer to it as MV100K. |
| Dataset Splits | Yes | The training and validation datasets were split into an 80:20 ratio. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU model, CPU type, or memory) used for running its experiments, only general mentions like 'on a GPU'. |
| Software Dependencies | No | The paper mentions various models and frameworks (e.g., CLIP, I3D, VIT-VQGAN) and general concepts (e.g., encoder-decoder Transformer) but does not provide specific version numbers for software dependencies like PyTorch, TensorFlow, or Python libraries. |
| Experiment Setup | Yes | For all visual features, we use a frame rate at 1 fps... we use encoder-decoder Transformer with 12 layers, 16 attention heads, an embedding dimension of 1024, feed-forward layers of dimensionality 4096, and relative positional embeddings. We use 10-second random crops... The coarse to fine acoustic tokens modeling is trained on 3-second crops. During inference, we use temperature sampling for all stages, with temperatures {1.0, 0.95, 0.4} for modeling stages 1, 2, and 3, respectively. |