One Token to Seg Them All: Language Instructed Reasoning Segmentation in Videos

Authors: Zechen Bai, Tong He, Haiyang Mei, Pichao WANG, Ziteng Gao, Joya Chen, liulei , Zheng Zhang, Mike Zheng Shou

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive evaluations on diverse benchmarks, including our newly introduced Reason VOS benchmark, demonstrate Video LISA s superior performance in video object segmentation tasks involving complex reasoning, temporal understanding, and object tracking.
Researcher Affiliation Collaboration Zechen Bai1 Tong He2 Haiyang Mei1 Pichao Wang2 Ziteng Gao1 Joya Chen1 Lei Liu2 Zheng Zhang2 Mike Zheng Shou1 1Show Lab, National University of Singapore 2Amazon
Pseudocode No The paper describes methods in prose and uses diagrams but does not include structured pseudocode or algorithm blocks.
Open Source Code No Code and model will be available at: https://github.com/showlab/Video LISA.
Open Datasets Yes Our model is trained on a variety of segmentation datasets. The image-based datasets include 1) semantic segmentation: ADE20K [73], COCO-Stuff [9], PACO-LVIS [54], and PASCALPart [11]; 2) referring segmentation: ref CLEF, ref COCO, ref COCO+ [28], and ref COCOg [48]; 3) reason segmentation: 239 Reason Seg samples from LISA [31]. The video-based datasets we use include: 1) semantic VOS: You Tube-VOS [66]; 2) referring VOS: Refer-You Tube-VOS [59] and Me Vi S [14].
Dataset Splits Yes Consistent with previous studies [14, 24], we evaluate our model s performance on the validation set of the Me Vi S benchmark.
Hardware Specification Yes We train our model using 64 NVIDIA 24G A10 GPUs with a distributed training script based on Deep Speed [56].
Software Dependencies Yes We implement our model with LLa VA-Phi-3-V [55], a multimodal LLM based on Phi-3 [1] with 3.8B parameters. We adopt the vision encoder and mask decoder from SAM [30].
Experiment Setup Yes For video data, we set Tsparse = 32 and Tdense = 4 according to our GPU memory. For image data, we duplicate the images as pseudo video data. We freeze the visual tokenizer and vision encoder, train the LLM with Lo RA [25] and train the mask decoder with full finetuning. We use the Adam W [44] optimizer with the learning rate and weight decay set to 0.0003 and 0, respectively. We also adopt Warmup Decay LR as the learning rate scheduler, with the warmup iterations set to 100. The weights of the text generation loss (λtxt) and the mask loss (λseg) are both set to 1.0. The weights of the BCE loss (λbce) and the DICE loss (λdice) are set to 2.0 and 0.5, respectively. The per-device batch size is set to 2. For ablation studies, the total number of iterations is 3, 000 and each experiment takes around 10 hours. For the final model used for comparison, we scale up the training to 6, 000 iterations, which takes 20 hours.