Modality-Independent Teachers Meet Weakly-Supervised Audio-Visual Event Parser

Authors: Yung-Hsuan Lai, Yen-Chun Chen, Frank Wang

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirical studies show that the harvested labels significantly improve an attentional baseline by 8.0 in average F-score (Type@AV).
Researcher Affiliation Collaboration Yung-Hsuan Lai,1 Yen-Chun Chen,2 Yu-Chiang Frank Wang1,3 1National Taiwan University 2Microsoft 3NVIDIA
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code Yes Code is available at: https://github.com/Franklin905/VALOR.
Open Datasets Yes Tian et al. [73] proposed the Audio-Visual Video Parsing (AVVP) task, which aims to recognize events in videos independently of the audio and visual modalities and also temporally localize these events. ... Tian et al. [73] created this dataset (Look, Listen, and Parse; LLP) in a weakly-supervised setting.
Dataset Splits Yes The dataset is divided into training, validation, and testing splits, containing 10, 000, 649, and 1200 clips, respectively.
Hardware Specification No The paper only generally states 'We thank National Center for High-performance Computing (NCHC) for providing computational and storage resources,' without specifying any particular hardware components like CPU or GPU models.
Software Dependencies No The paper mentions software components and models like 'Res Net-152', 'VGGish', 'CLIP', and 'CLAP' but does not provide specific version numbers for these or other ancillary software dependencies required for replication.
Experiment Setup Yes The models are trained using the Adam W optimizer, configured with β1 = 0.5, β2 = 0.999, and weight decay set to 0.001. We employ a learning rate scheduling approach that initiates with a linear warm-up phase over 10 epochs, rises to the peak learning rate, and then decays according to a cosine annealing schedule to the minimum learning rate. We set the batch size to 64 and train for 60 epochs in total. We clip the gradient norm at 1.0 during training.