reproducibilityindex.ai

AVSegFormer: Audio-Visual Segmentation with Transformer

Authors: Shengyi Gao, Zhe Chen, Guo Chen, Wenhai Wang, Tong Lu

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments demonstrate that AVSeg Former achieves stateof-the-art results on the AVS benchmark. The code is available at https://github.com/vvvb-github/AVSeg Former.
Researcher Affiliation	Academia	Shengyi Gao1, Zhe Chen1, Guo Chen1, Wenhai Wang2, Tong Lu1* 1State Key Lab for Novel Software Technology, Nanjing University 2The Chinese University of Hong Kong lutong@nju.edu.cn
Pseudocode	No	The paper describes the methods using text and architectural diagrams (e.g., Figure 3), but it does not include any explicit pseudocode or algorithm blocks.
Open Source Code	Yes	The code is available at https://github.com/vvvb-github/AVSeg Former.
Open Datasets	Yes	AVSBench-Object (Zhou et al. 2022) is an audio-visual dataset specifically designed for the audio-visual segmentation task, containing pixel-level annotations. ... AVSBench-Semantic (Zhou et al. 2023) is an extension of the AVSBench-Object...
Dataset Splits	Yes	S4 subset: The S4 subset contains 4,932 videos, with 3,452 videos for training, 740 for validation, and 740 for testing.
Hardware Specification	Yes	We train our AVSeg Former models for the three AVS sub-tasks using an NVIDIA V100 GPU.
Software Dependencies	No	The paper mentions using Adam W as the optimizer but does not specify software versions for programming languages, libraries, or frameworks (e.g., Python, PyTorch, CUDA versions).
Experiment Setup	Yes	Consistent with previous works (Zhou et al. 2022, 2023), we employ Adam W (Loshchilov and Hutter 2017) as the optimizer, with a batch size of 2 and an initial learning rate of 2 10 5. Since the MS3 subset is quite small, we train it for 60 epochs, while the S4 and AVSS subsets are trained for 30 epochs. The encoder and decoder in our AVSeg Former comprise 6 layers with an embedding dimension of 256. We set the coefficient of the proposed mixing loss Lmix to 0.1 for the best performance.