AVSegFormer: Audio-Visual Segmentation with Transformer
Authors: Shengyi Gao, Zhe Chen, Guo Chen, Wenhai Wang, Tong Lu
AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments demonstrate that AVSeg Former achieves stateof-the-art results on the AVS benchmark. The code is available at https://github.com/vvvb-github/AVSeg Former. |
| Researcher Affiliation | Academia | Shengyi Gao1, Zhe Chen1, Guo Chen1, Wenhai Wang2, Tong Lu1* 1State Key Lab for Novel Software Technology, Nanjing University 2The Chinese University of Hong Kong lutong@nju.edu.cn |
| Pseudocode | No | The paper describes the methods using text and architectural diagrams (e.g., Figure 3), but it does not include any explicit pseudocode or algorithm blocks. |
| Open Source Code | Yes | The code is available at https://github.com/vvvb-github/AVSeg Former. |
| Open Datasets | Yes | AVSBench-Object (Zhou et al. 2022) is an audio-visual dataset specifically designed for the audio-visual segmentation task, containing pixel-level annotations. ... AVSBench-Semantic (Zhou et al. 2023) is an extension of the AVSBench-Object... |
| Dataset Splits | Yes | S4 subset: The S4 subset contains 4,932 videos, with 3,452 videos for training, 740 for validation, and 740 for testing. |
| Hardware Specification | Yes | We train our AVSeg Former models for the three AVS sub-tasks using an NVIDIA V100 GPU. |
| Software Dependencies | No | The paper mentions using Adam W as the optimizer but does not specify software versions for programming languages, libraries, or frameworks (e.g., Python, PyTorch, CUDA versions). |
| Experiment Setup | Yes | Consistent with previous works (Zhou et al. 2022, 2023), we employ Adam W (Loshchilov and Hutter 2017) as the optimizer, with a batch size of 2 and an initial learning rate of 2 10 5. Since the MS3 subset is quite small, we train it for 60 epochs, while the S4 and AVSS subsets are trained for 30 epochs. The encoder and decoder in our AVSeg Former comprise 6 layers with an embedding dimension of 256. We set the coefficient of the proposed mixing loss Lmix to 0.1 for the best performance. |