Modality-Independent Teachers Meet Weakly-Supervised Audio-Visual Event Parser
Authors: Yung-Hsuan Lai, Yen-Chun Chen, Frank Wang
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirical studies show that the harvested labels significantly improve an attentional baseline by 8.0 in average F-score (Type@AV). |
| Researcher Affiliation | Collaboration | Yung-Hsuan Lai,1 Yen-Chun Chen,2 Yu-Chiang Frank Wang1,3 1National Taiwan University 2Microsoft 3NVIDIA |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code is available at: https://github.com/Franklin905/VALOR. |
| Open Datasets | Yes | Tian et al. [73] proposed the Audio-Visual Video Parsing (AVVP) task, which aims to recognize events in videos independently of the audio and visual modalities and also temporally localize these events. ... Tian et al. [73] created this dataset (Look, Listen, and Parse; LLP) in a weakly-supervised setting. |
| Dataset Splits | Yes | The dataset is divided into training, validation, and testing splits, containing 10, 000, 649, and 1200 clips, respectively. |
| Hardware Specification | No | The paper only generally states 'We thank National Center for High-performance Computing (NCHC) for providing computational and storage resources,' without specifying any particular hardware components like CPU or GPU models. |
| Software Dependencies | No | The paper mentions software components and models like 'Res Net-152', 'VGGish', 'CLIP', and 'CLAP' but does not provide specific version numbers for these or other ancillary software dependencies required for replication. |
| Experiment Setup | Yes | The models are trained using the Adam W optimizer, configured with β1 = 0.5, β2 = 0.999, and weight decay set to 0.001. We employ a learning rate scheduling approach that initiates with a linear warm-up phase over 10 epochs, rises to the peak learning rate, and then decays according to a cosine annealing schedule to the minimum learning rate. We set the batch size to 64 and train for 60 epochs in total. We clip the gradient norm at 1.0 during training. |