Discovering Sounding Objects by Audio Queries for Audio Visual Segmentation

Authors: Shaofei Huang, Han Li, Yuqing Wang, Hongji Zhu, Jiao Dai, Jizhong Han, Wenge Rong, Si Liu

IJCAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments are conducted on two AVS benchmarks to show that our method achieves state-of-the-art performances, especially 7.1% MJ and 7.6% MF gains on the MS3 setting.
Researcher Affiliation Collaboration Shaofei Huang1,2 , Han Li3 , Yuqing Wang4 , Hongji Zhu4 , Jiao Dai1,2 , Jizhong Han1,2 , Wenge Rong3 and Si Liu5,6 1Institute of Information Engineering, Chinese Academy of Sciences 2School of Cyber Security, University of Chinese Academy of Sciences 3School of Computer Science and Engineering, Beihang University 4Alibaba Group 5Institute of Artificial Intelligence, Beihang University 6Hangzhou Innovation Institute, Beihang University
Pseudocode No The paper does not contain any sections explicitly labeled 'Pseudocode' or 'Algorithm,' nor does it present any structured, code-like blocks.
Open Source Code No The paper mentions 'More implementation details are provided in the supplementary materials' but does not explicitly state that the source code for the methodology is available, nor does it provide a direct link to a code repository.
Open Datasets Yes We conduct experiments on two benchmark settings of the AVS [Zhou et al., 2022]: 1) The semi-supervised Single Sound Source Segmentation (S4) where the sounding object remains the same in the given video clip, i.e., only the mask annotation of the first frame is provided for training; 2) The fully supervised Multiple Sound Source Segmentation (MS3) where the sounding object dynamically changes over time, and mask annotations of all the T frames are provided. ... We use both Res Net-50 [He et al., 2016] pretrained on MSCOCO [Lin et al., 2014] dataset and the PVT-v2 b5 [Wang et al., 2022] pretrained on Image Net [Russakovsky et al., 2015] dataset as the visual encoders.
Dataset Splits No The paper mentions training and testing on S4 and MS3 benchmarks and evaluating performance with MF and MJ metrics, but it does not specify explicit validation dataset splits (e.g., percentages, sample counts, or predefined splits for validation).
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU models, CPU types, or cloud instance specifications) used for running the experiments.
Software Dependencies No The paper mentions using ResNet-50 and PVT-v2 as visual encoders and the AdamW optimizer, but it does not specify versions for any ancillary software dependencies (e.g., programming languages, deep learning frameworks, or libraries).
Experiment Setup Yes The total number of video frames T is set to 5 for each video clip. The number N of transformer decoder stage is set 3. λbce, λdice and λsim are all set to 1. For the S4 setting, we use the polynomial learning rate schedule and the Adam W optimizer with an initial learning rate of 1.25e 4 and weight decay of 5e 2. Batch size is set to 8/4 and the total number of training iterations is set to 20k/40k for experiments on Res Net-50/PVT-v2 b5. The MS3 setting adopts the same hyperparameters except: (1) the initial learning rate is 5e 4, (2) the total training iterations are reduced to 2k/4k.