Long-Term Rhythmic Video Soundtracker

Authors: Jiashuo Yu, Yaohui Wang, Xinyuan Chen, Xiao Sun, Yu Qiao

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments show that our model generates long-term soundtracks with state-of-the-art musical quality and rhythmic correspondence.
Researcher Affiliation Academia 1Shanghai Artificial Intelligence Laboratory.
Pseudocode No The paper contains architectural diagrams and mathematical formulations but does not include structured pseudocode or algorithm blocks.
Open Source Code Yes Codes are available at https: //github.com/Open GVLab/LORIS.
Open Datasets Yes To this end, we curate the LORIS dataset based on existing datasets, which involves 86.43h paired videos varying from dances to multiple sports events. The comparison of our dataset with existing datasets is listed in Table 1.
Dataset Splits Yes Finally, we randomly split the dataset with a 90%/5%/5% proportion.
Hardware Specification Yes we use 8 NVIDIA A100 GPUs to train our model
Software Dependencies Yes We use audio-diffusion-pytorch-v0.0.43 (Schneider, 2023) as our basic backbone.
Experiment Setup Yes The dimension of all hidden layers is set to 1024, and the embedding size of genre labels and RGB features is also 1024. For visual rhythm extraction, the bin number K is set to 10 and hyperparameters of peak picking strategy are prem = 3, prem = 3, posta = 3, and posta = 3. We set the threshold offset δ as 0.2 multiples by the current local maxima, and peak wait number ω = 1. The audio sampling rate is set to 22050 Hz. We use Adam W (Loshchilov & Hutter, 2017) as the optimizer with β1 = 0.9, β2 = 0.96 and weight decay of 4.5e-2. A two-stage learning rate strategy is applied during training. Specifically, for layers in the unconditional diffusion model pre-trained by (Schneider, 2023), we set the learning rate as 3e-6 while the initial learning rate of all other layers is 3e-3. We set a warm-up learning rate of 2e-4 for all layers in the first 1,000 training iterations. We also apply the gradient clipping with the max norm of 0.5. The entire LORIS framework is optimized jointly and we use 8 NVIDIA A100 GPUs to train our model for 100 epochs on the dancing subset, 200 epochs on the floor exercise subset, and 250 epochs on the figure skating subset. For music sampling, we employ the classifier-free guidance (Ho & Salimans, 2022) to perform conditional generation with guidance scale w = 20. The diffusion step number during inference is set to 50 as a trade-off of music quality and inference speed.