How Does it Sound?

Authors: Kun Su, Xiulong Liu, Eli Shlizerman

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate Rhythmic Net on large scale video datasets that include body movements with inherit sound association, such as dance, as well as in the wild internet videos of various movements and actions. We show that the method can generate plausible music that aligns with different types of human movements. 4 Experiments & Results
Researcher Affiliation Academia Department of Electrical & Computer Engineering, University of Washington, Seattle, USA. Department of Applied Mathematics, University of Washington, Seattle, USA Corresponding author: shlizee@uw.edu
Pseudocode No The paper describes the computational steps and models in text but does not include any formally structured pseudocode or algorithm blocks.
Open Source Code Yes Code. System setup and code are available in a Github repository5. https://github.com/shlizee/Rhythmic Net
Open Datasets Yes We use the AIST Dance Video Database, a large-scale collection of dance videos in 60fps for training and testing of Video2Rhythm [69]. For Rhythm2Drum, we use the Groove Midi dataset [50] which contains 1150 Midi files and over 22, 000 measures of drumming. For Drum2Music, we extract two subsets of Lakh Midi dataset [70] to separately train Drum2Piano and Drum2Guitar models.
Dataset Splits Yes We split the samples into train/validate/test sets by 0.8/0.1/0.1 based on the dance genres, dancers, and camera ids. We split the data into 0.8/0.1/0.1 of train/validate/test sets. This results in 34991/1944/1944 segments for train/validate/test sets respectively. For drum2guitar, we perform a similar selection to obtain 12904/717/717 segments for train/validate/test sets respectively.
Hardware Specification Yes We use Pytorch [71] to implement all models in Rhymic Net with two Titan X GPUs.
Software Dependencies No The paper states 'We use Pytorch [71] to implement all models' but does not provide specific version numbers for Pytorch or other software dependencies mentioned like Open Pose framework, U-net, or Transformer-XL.
Experiment Setup Yes In Video2Rhythm, the network contains a 10-layer ST-GCN and a 2-layer transformer encoder with 2-head attention. ... In Drum2Music, the model consists of a recurrent transformer encoder and a recurrent transformer decoder. We set the number of encoder layers, decoder layers, encoder heads and decoder heads to 4, 8, 8, and 8 respectively. The length of the training input tokens and the length of the memory is 256. We provide additional configuration details in the supplementary materials.