Cross-Modal Label Contrastive Learning for Unsupervised Audio-Visual Event Localization

Authors: Peijun Bao, Wenhan Yang, Boon Poh Ng, Meng Hwa Er, Alex C. Kot

AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments demonstrate that our unsupervised approach performs reasonably well compared to the state-of-the-art supervised methods.
Researcher Affiliation Academia Peijun Bao1, Wenhan Yang*1,2, Boon Poh Ng1, Meng Hwa Er1, Alex C. Kot1 1Nanyang Technological University 2Peng Cheng Laboratory
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code No The paper does not provide any concrete access to source code for the methodology described, nor does it explicitly state that the code is released.
Open Datasets Yes Following existing works on fully/weakly-supervised settings (Zhou et al. 2021; Tian et al. 2018; Xu et al. 2020), we conduct our experiment on the AVE dataset (Tian et al. 2018).
Dataset Splits No The paper states 'We follow the identical setting to previous works for trainset/testset data splitting' but does not provide specific details on validation splits or percentages for any splits.
Hardware Specification No The paper does not provide specific hardware details such as GPU or CPU models used for running the experiments.
Software Dependencies No The paper mentions optimizers (SGD, Adam) and pretrained networks (VGG-like, VGG-19, ResNet-151) but does not provide specific software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup Yes The audio-visual collaborative learning and self-label contrastive model is trained with SGD optimizer with a learning rate of 0.05 and the batch size is set to 32. The learning rate is gradually decayed with a cosine decay schedule (Loshchilov and Hutter 2017). The hidden dimension d in the cross-modal collaboration module is set to 512. The number of parallel attention heads is set to 4. The memory bank is maintained with a momentum of 0.9 for each modality respectively. The numbers of training epochs for audio-visual collaborative learning and self-label contrastive model are both set to 200. To make the convergence more stable, we implement the feature decorrelation loss with its soft version as in (Tao, Takagi, and Nakata 2021). The hyperparameter C is set to 28. The contrastive set size K is set to 30. The localization model is trained with Adam (Kingma and Ba 2014) optimizer with a learning rate of 5 10 4 and weight decay of 5 10 4. And the total EM step is set to 3.