Dual Mean-Teacher: An Unbiased Semi-Supervised Framework for Audio-Visual Source Localization
Authors: Yuxin Guo, Shijie Ma, Hu Su, Zhiqing Wang, Yuhao Zhao, Wei Zou, Siyang Sun, Yun Zheng
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this paper, we propose a novel semi-supervised learning framework for AVSL, namely Dual Mean-Teacher (DMT), comprising two teacher-student structures to circumvent the confirmation bias issue. ... enable DMT to outperform current state-of-the-art methods by a large margin, with CIo U of 90.4% and 48.8% on Flickr-Sound Net and VGG-Sound Source, obtaining 8.9%, 9.6% and 4.6%, 6.4% improvements over selfand semi-supervised methods respectively, given only < 3% positional-annotations. |
| Researcher Affiliation | Collaboration | Yuxin Guo1,2,3, Shijie Ma1,2, Hu Su1,2, Zhiqing Wang1,2, Yuhao Zhao1,2, Wei Zou1,2 School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China 2State Key Laboratory of Multimodal Artificial Intelligence Systems (MAIS), Institute of Automation of Chinese Academy of Sciences, Beijing, China 3DAMO Academy, Alibaba Group |
| Pseudocode | Yes | A Algorithm To make it more clear, Dual Mean-Teacher is specifically depicted in Algorithm 1. Algorithm 1 Dual Mean-Teacher algorithm. |
| Open Source Code | Yes | Our code is available at https://github.com/gyx-gloria/DMT. |
| Open Datasets | Yes | We conduct experiments on two large-scale audio-visual datasets: Flickr-Sound Net [14, 15] and VGG Sound Source [46], where there are 5,000 and 5,158 bounding-box annotated samples, respectively. |
| Dataset Splits | Yes | For labeled data, we randomly select 4,250 for training, 500 for validating, and keep the same test sets with 250 samples as previous works [9, 10, 28, 30]. |
| Hardware Specification | No | The paper states 'The training is conducted on two GPUs,' but does not provide specific details on the GPU model, CPU, or other hardware components. |
| Software Dependencies | No | The paper mentions software components like ResNet18, VGGish, Sound Net, Rand Aug, and Adam optimizer but does not provide specific version numbers for any of these or other software dependencies. |
| Experiment Setup | Yes | For audio clips, we pass 96 64 log-mel spectrograms to VGGish, and the output is a 512D vector, while the raw waveform of the original 3s audio clip is sent to Sound Net. For frames, we used an input image of size 256 256 3, with 224 224 512 as output. We choose Rand Aug [51] as strong augmentation, while random cropping, resizing, and random horizontal flip as weak augmentation. We set δ as 0.6 and τ as 0.7. ... we used a batch size of 128. Warm-Up stage is trained for 6 epochs to achieve convergence, while the Unbiased-Learning stage is trained for 20 epochs. The learning rate for the image is set to 1e-4, and the weight for the contrastive loss λu is set to 1. An Exponential Moving Average (EMA) decay of 0.999 is applied. The Adam optimizer is used for training... |