Segment beyond View: Handling Partially Missing Modality for Audio-Visual Semantic Segmentation
Authors: Renjie Wu, Hu Wang, Feras Dayoub, Hsiang-Ting Chen
AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | SBV outperforms existing models in comparative evaluations and shows a consistent performance across varying Fo V ranges and in monaural audio settings. Adapting the Omni Auditory Perception Dataset (Dai et al. 2022; Vasudevan, Dai, and Van Gool 2020) to the proposed task, the results suggest that our method outperforms state-of-the-art audio-visual semantic segmentation methods (Zhou et al. 2022, 2023) and maintain consistent performance across different Fo V ranges and in monaural audio environments. Demonstrating the superior performance of SBV through comparison with state-of-the-art models and presenting ablation studies examining various degrees of partially missing modality and model architectures. |
| Researcher Affiliation | Academia | The University of Adelaide {renjie.wu, hu.wang, feras.dayoub, tim.chen}@adelaide.edu.au |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. It describes the model architecture and training process in text and diagrams, but without pseudocode formatting. |
| Open Source Code | No | The paper does not provide concrete access to source code, such as a specific repository link, an explicit code release statement, or mention of code in supplementary materials. |
| Open Datasets | Yes | Adapting the Omni Auditory Perception Dataset (Dai et al. 2022; Vasudevan, Dai, and Van Gool 2020) to the proposed task |
| Dataset Splits | Yes | In addition to the normal training dataset (51, 400) and validation dataset (6, 208), it contains two test datasets: Auditory Test Pseudo dataset (6, 492) and Auditory Test Manual dataset. |
| Hardware Specification | Yes | We train models by using NVIDIA A100 GPUs. |
| Software Dependencies | No | The paper mentions various software components and models (e.g., Adam, Open CV, Seg Former, Sound Net, Res Net50, Deep Labv3+), and provides citations for them, but does not specify their version numbers (e.g., Open CV version 4.x, PyTorch version X.Y). |
| Experiment Setup | Yes | We use Adam (Kingma and Ba 2014) and set learning rate as 1 10 5 for the optimizer. We use one cycle policy (Smith and Topin 2019) as our learning rate decay strategy. All images are resized to 480 480. The spectrogram size is set as 257 601. All student models are trained for 50 epochs to ensure that the loss converges. For the Eqn. 7, we set βa = 0.1 and βv = 0.4 for logits distillation; about the feature distillation part, we set all λ = 0.05 and all γ = 0.02. |