SOC: Semantic-Assisted Object Cluster for Referring Video Object Segmentation
Authors: Zhuoyan Luo, Yicheng Xiao, Yong Liu, Shuyan Li, Yitong Wang, Yansong Tang, Xiu Li, Yujiu Yang
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct extensive experiments on popular RVOS benchmarks, and our method outperforms state-of-the-art competitors on all benchmarks by a remarkable margin. |
| Researcher Affiliation | Collaboration | 1Tsinghua Shenzhen International Graduate School, Tsinghua University 2Byte Dance Inc. 3Engineering Department, University of Cambridge |
| Pseudocode | No | The paper describes its approach using text and diagrams but does not include pseudocode or a clearly labeled algorithm block. |
| Open Source Code | Yes | Code is available at https://github.com/Robert Luo1/Neur IPS2023_SOC. |
| Open Datasets | Yes | We evaluate our model on four prevalent RVOS benchmarks: Ref-You Tube-VOS [36], Ref-DAVIS17 [16], A2D-Sentences, and JHMDB-Sentences [8]. |
| Dataset Splits | Yes | We evaluate our model on four prevalent RVOS benchmarks: Ref-You Tube-VOS [36], Ref-DAVIS17 [16], A2D-Sentences, and JHMDB-Sentences [8]. Following [1, 42], we measure the effectiveness of our model by criteria of Precision@K, Overall Io U, Mean Io U and MAP over 0.50:0.05:0.95 for A2D-Sentences and JHMDB-Sentences. Meanwhile, we adopt standard evaluation metrics: region similarity(J ), contour accuracy (F) and their average value (J &F) on Ref-You Tube-VOS and Ref-DAVIS17. |
| Hardware Specification | Yes | The models are trained with eight 32GB V100 GPUs in default. Specifically, our SOC runs at 32.3 FPS on single 3090 GPU. |
| Software Dependencies | No | The paper mentions using 'Video Swin Transformer [27] and Ro BERTa [23] as our encoder' but does not specify software dependencies with version numbers (e.g., Python, PyTorch versions). |
| Experiment Setup | Yes | The number of frame-level queries Of and video-level queries Ov are set as 20 in default. We feed the model windows of w = 8 frames during training. The coefficients for losses are set as λcls = 2, λL1 = 2, λgiou = 2, λdice = 2, λfocal = 5, λcon = 1. |