Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Aligning Instance Brownian Bridge with Texts for Open-Vocabulary Video Instance Segmentation

Authors: Zesen Cheng, Kehan Li, Li Hao, Peng Jin, Xiawu Zheng, Chang Liu, Jie Chen

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our method achieves 5.7 m AP and 20.9 m AP on BURST and LVVIS, which is +2.2 m AP and +6.7 m AP better than previous state-of-the-art OV2Seg (Wang et al. 2023).
Researcher Affiliation Academia 1 School of Electronic and Computer Engineering, Peking University, Shenzhen, China 2 Pengcheng Laboratory, Shenzhen, China 3 AI for Science (AI4S)-Preferred Program, Peking University Shenzhen Graduate School, China 4 Tsinghua University, Beijing, China 5 Xiamen University, Xiamen, China
Pseudocode No The paper describes the methodology in text and provides diagrams (Figure 2: The overall pipeline of our Bri VIS; Figure 3: Resampler) but does not include any explicit pseudocode or algorithm blocks.
Open Source Code No The paper does not provide any explicit statement about releasing source code, nor does it include links to a code repository or mention code in supplementary materials.
Open Datasets Yes Following previous work (Guo et al. 2023), we use Youtube-VIS 2019 (Yang, Fan, and Xu 2019) and COCO (Lin et al. 2014) as the training set. During the evaluation phase, we set large-vocabulary VIS datasets (BURST (Athar et al. 2023), LV-VIS (Wang et al. 2023)) as our benchmark.
Dataset Splits Yes Following previous work (Guo et al. 2023), we use Youtube-VIS 2019 and COCO (Lin et al. 2014) as the training set. During the evaluation phase, we set large-vocabulary VIS datasets (BURST (Athar et al. 2023), LV-VIS (Wang et al. 2023)) as our benchmark. In this protocol, the m AP on all categories of BURST (482 categories) and LVVIS (1196 categories) are adopted as evaluation results. [...] In this protocol, we first split the categories of BURST and LVVIS into base part and novel part according to the CLIP similarity between class texts of evaluation and training dataset. BURST has 95 base categories and 387 novel categories. LVVIS has 124 base categories and 1072 novel categories.
Hardware Specification No The paper does not provide specific details about the hardware used to run the experiments, such as GPU or CPU models.
Software Dependencies No The paper mentions several software components and models like Mask2Former, CLIP, and AdamW, but it does not specify any version numbers for these or other relevant software dependencies.
Experiment Setup Yes We resize the shorter side to either 360 or 480 and adopt a random horizontal flip strategy. To reduce training costs, we adopt two-stage training strategies. In the first stage, we randomly sample frames from departed videos and mix them with image data for training an open-vocabulary instance segmentor. The number of sampled frames T is set to 1. We train the open-vocabulary instance segmentation for 6k iterations with a batch size of 16 in this stage. [...] In the second stage, we train the whole model on coherent clips. [...] We sample T = 5 frames from videos and train models for 6k iterations with a batch size of 16. Adam W (Loshchilov and Hutter 2019) is adopted as our optimizer, and the learning rate is set to 1e-4, which is scaled by a decay factor of 0.1 at the 5k iterations of two training stages. The number of queries N is set to 100. The bound value is set to 0.5 by default.