reproducibilityindex.ai

Modality-Independent Teachers Meet Weakly-Supervised Audio-Visual Event Parser

Authors: Yung-Hsuan Lai, Yen-Chun Chen, Frank Wang

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirical studies show that the harvested labels significantly improve an attentional baseline by 8.0 in average F-score (Type@AV).
Researcher Affiliation	Collaboration	Yung-Hsuan Lai,1 Yen-Chun Chen,2 Yu-Chiang Frank Wang1,3 1National Taiwan University 2Microsoft 3NVIDIA
Pseudocode	No	The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code	Yes	Code is available at: https://github.com/Franklin905/VALOR.
Open Datasets	Yes	Tian et al. [73] proposed the Audio-Visual Video Parsing (AVVP) task, which aims to recognize events in videos independently of the audio and visual modalities and also temporally localize these events. ... Tian et al. [73] created this dataset (Look, Listen, and Parse; LLP) in a weakly-supervised setting.
Dataset Splits	Yes	The dataset is divided into training, validation, and testing splits, containing 10, 000, 649, and 1200 clips, respectively.
Hardware Specification	No	The paper only generally states 'We thank National Center for High-performance Computing (NCHC) for providing computational and storage resources,' without specifying any particular hardware components like CPU or GPU models.
Software Dependencies	No	The paper mentions software components and models like 'Res Net-152', 'VGGish', 'CLIP', and 'CLAP' but does not provide specific version numbers for these or other ancillary software dependencies required for replication.
Experiment Setup	Yes	The models are trained using the Adam W optimizer, configured with β1 = 0.5, β2 = 0.999, and weight decay set to 0.001. We employ a learning rate scheduling approach that initiates with a linear warm-up phase over 10 epochs, rises to the peak learning rate, and then decays according to a cosine annealing schedule to the minimum learning rate. We set the batch size to 64 and train for 60 epochs in total. We clip the gradient norm at 1.0 during training.