V2Meow: Meowing to the Visual Beat via Video-to-Music Generation

Authors: Kun Su, Judith Yue Li, Qingqing Huang, Dima Kuzmin, Joonseok Lee, Chris Donahue, Fei Sha, Aren Jansen, Yu Wang, Mauro Verzetti, Timo Denk

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Trained on 5k hours of music audio clips paired with video frames mined from in-the-wild music videos, V2Meow is competitive with previous domain-specific models when evaluated in a zero-shot manner. Through both qualitative and quantitative evaluations, we demonstrate that our model outperforms various existing music generation systems in terms of visual-audio correspondence and audio quality.
Researcher Affiliation Collaboration 1Google Research, Mountain View, CA, USA 2Google Deep Mind, Mountain View, CA, USA 3University of Washington, Seattle, WA, USA 4Byte Dance, San Jose, CA, USA 5Seoul National University, Seoul, South Korea 6Carnegie Mellon University, Pittsburgh, PA, USA 7Music and Audio Research Laboratory, New York University, NY, USA
Pseudocode No The paper does not contain any pseudocode or algorithm blocks.
Open Source Code No The paper states 'Music samples are available at tinyurl.com/v2meow' but does not provide a link or statement about the availability of the source code for the described methodology.
Open Datasets Yes Training Datasets. Following (Sur ıs et al. 2022), we filtered a public available video dataset (Abu-El-Haija et al. 2016) to 110k videos with the label Music Videos and refer to it as MV100K.
Dataset Splits Yes The training and validation datasets were split into an 80:20 ratio.
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU model, CPU type, or memory) used for running its experiments, only general mentions like 'on a GPU'.
Software Dependencies No The paper mentions various models and frameworks (e.g., CLIP, I3D, VIT-VQGAN) and general concepts (e.g., encoder-decoder Transformer) but does not provide specific version numbers for software dependencies like PyTorch, TensorFlow, or Python libraries.
Experiment Setup Yes For all visual features, we use a frame rate at 1 fps... we use encoder-decoder Transformer with 12 layers, 16 attention heads, an embedding dimension of 1024, feed-forward layers of dimensionality 4096, and relative positional embeddings. We use 10-second random crops... The coarse to fine acoustic tokens modeling is trained on 3-second crops. During inference, we use temperature sampling for all stages, with temperatures {1.0, 0.95, 0.4} for modeling stages 1, 2, and 3, respectively.