reproducibilityindex.ai

One Token to Seg Them All: Language Instructed Reasoning Segmentation in Videos

Authors: Zechen Bai, Tong He, Haiyang Mei, Pichao WANG, Ziteng Gao, Joya Chen, liulei , Zheng Zhang, Mike Zheng Shou

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive evaluations on diverse benchmarks, including our newly introduced Reason VOS benchmark, demonstrate Video LISA s superior performance in video object segmentation tasks involving complex reasoning, temporal understanding, and object tracking.
Researcher Affiliation	Collaboration	Zechen Bai1 Tong He2 Haiyang Mei1 Pichao Wang2 Ziteng Gao1 Joya Chen1 Lei Liu2 Zheng Zhang2 Mike Zheng Shou1 1Show Lab, National University of Singapore 2Amazon
Pseudocode	No	The paper describes methods in prose and uses diagrams but does not include structured pseudocode or algorithm blocks.
Open Source Code	No	Code and model will be available at: https://github.com/showlab/Video LISA.
Open Datasets	Yes	Our model is trained on a variety of segmentation datasets. The image-based datasets include 1) semantic segmentation: ADE20K [73], COCO-Stuff [9], PACO-LVIS [54], and PASCALPart [11]; 2) referring segmentation: ref CLEF, ref COCO, ref COCO+ [28], and ref COCOg [48]; 3) reason segmentation: 239 Reason Seg samples from LISA [31]. The video-based datasets we use include: 1) semantic VOS: You Tube-VOS [66]; 2) referring VOS: Refer-You Tube-VOS [59] and Me Vi S [14].
Dataset Splits	Yes	Consistent with previous studies [14, 24], we evaluate our model s performance on the validation set of the Me Vi S benchmark.
Hardware Specification	Yes	We train our model using 64 NVIDIA 24G A10 GPUs with a distributed training script based on Deep Speed [56].
Software Dependencies	Yes	We implement our model with LLa VA-Phi-3-V [55], a multimodal LLM based on Phi-3 [1] with 3.8B parameters. We adopt the vision encoder and mask decoder from SAM [30].
Experiment Setup	Yes	For video data, we set Tsparse = 32 and Tdense = 4 according to our GPU memory. For image data, we duplicate the images as pseudo video data. We freeze the visual tokenizer and vision encoder, train the LLM with Lo RA [25] and train the mask decoder with full finetuning. We use the Adam W [44] optimizer with the learning rate and weight decay set to 0.0003 and 0, respectively. We also adopt Warmup Decay LR as the learning rate scheduler, with the warmup iterations set to 100. The weights of the text generation loss (λtxt) and the mask loss (λseg) are both set to 1.0. The weights of the BCE loss (λbce) and the DICE loss (λdice) are set to 2.0 and 0.5, respectively. The per-device batch size is set to 2. For ablation studies, the total number of iterations is 3, 000 and each experiment takes around 10 hours. For the final model used for comparison, we scale up the training to 6, 000 iterations, which takes 20 hours.