One Token to Seg Them All: Language Instructed Reasoning Segmentation in Videos
Authors: Zechen Bai, Tong He, Haiyang Mei, Pichao WANG, Ziteng Gao, Joya Chen, liulei , Zheng Zhang, Mike Zheng Shou
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive evaluations on diverse benchmarks, including our newly introduced Reason VOS benchmark, demonstrate Video LISA s superior performance in video object segmentation tasks involving complex reasoning, temporal understanding, and object tracking. |
| Researcher Affiliation | Collaboration | Zechen Bai1 Tong He2 Haiyang Mei1 Pichao Wang2 Ziteng Gao1 Joya Chen1 Lei Liu2 Zheng Zhang2 Mike Zheng Shou1 1Show Lab, National University of Singapore 2Amazon |
| Pseudocode | No | The paper describes methods in prose and uses diagrams but does not include structured pseudocode or algorithm blocks. |
| Open Source Code | No | Code and model will be available at: https://github.com/showlab/Video LISA. |
| Open Datasets | Yes | Our model is trained on a variety of segmentation datasets. The image-based datasets include 1) semantic segmentation: ADE20K [73], COCO-Stuff [9], PACO-LVIS [54], and PASCALPart [11]; 2) referring segmentation: ref CLEF, ref COCO, ref COCO+ [28], and ref COCOg [48]; 3) reason segmentation: 239 Reason Seg samples from LISA [31]. The video-based datasets we use include: 1) semantic VOS: You Tube-VOS [66]; 2) referring VOS: Refer-You Tube-VOS [59] and Me Vi S [14]. |
| Dataset Splits | Yes | Consistent with previous studies [14, 24], we evaluate our model s performance on the validation set of the Me Vi S benchmark. |
| Hardware Specification | Yes | We train our model using 64 NVIDIA 24G A10 GPUs with a distributed training script based on Deep Speed [56]. |
| Software Dependencies | Yes | We implement our model with LLa VA-Phi-3-V [55], a multimodal LLM based on Phi-3 [1] with 3.8B parameters. We adopt the vision encoder and mask decoder from SAM [30]. |
| Experiment Setup | Yes | For video data, we set Tsparse = 32 and Tdense = 4 according to our GPU memory. For image data, we duplicate the images as pseudo video data. We freeze the visual tokenizer and vision encoder, train the LLM with Lo RA [25] and train the mask decoder with full finetuning. We use the Adam W [44] optimizer with the learning rate and weight decay set to 0.0003 and 0, respectively. We also adopt Warmup Decay LR as the learning rate scheduler, with the warmup iterations set to 100. The weights of the text generation loss (λtxt) and the mask loss (λseg) are both set to 1.0. The weights of the BCE loss (λbce) and the DICE loss (λdice) are set to 2.0 and 0.5, respectively. The per-device batch size is set to 2. For ablation studies, the total number of iterations is 3, 000 and each experiment takes around 10 hours. For the final model used for comparison, we scale up the training to 6, 000 iterations, which takes 20 hours. |