NExT-Chat: An LMM for Chat, Detection and Segmentation
Authors: Ao Zhang, Yuan Yao, Wei Ji, Zhiyuan Liu, Tat-Seng Chua
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Comprehensive experiments show the effectiveness of our NEx T-Chat on various tasks, e.g., NEx T-Chat (87.7) vs. Shikra (86.9) on POPE-Random, NEx TChat (71.3) vs. LISA (67.9) on referring expression segmentation task, and NEx T-Chat (79.6) vs. Kosmos-2 (62.3) on region caption task. |
| Researcher Affiliation | Academia | 1National University of Singapore 2Tsinghua University. |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide concrete access to source code for the methodology described. |
| Open Datasets | Yes | Stage-1. In stage-1, we perform pre-training using a mixture of data from various sources, including Flickr30K Entities (Plummer et al., 2015), Visual Genome (Krishna et al., 2017), Ref COCO (Yu et al., 2016), Ref COCO+ (Yu et al., 2016), Ref COCOg (Mao et al., 2016), VQAv2 (Antol et al., 2015), Point QA (Mani et al., 2020), Visual7W (Zhu et al., 2016), and VCR (Zellers et al., 2019). |
| Dataset Splits | Yes | To rigorously assess our model s proficiency in generating segmentation masks guided by natural language instructions, we use the referring expression segmentation (RES) splits of Ref COCO, Ref COCO+, and Ref COCOg. As for baselines, we choose both the LMM based methods (LISA (Lai et al., 2023) and GLa MM (Rasheed et al., 2023)) and non-LMM based methods including MCN (Luo et al., 2020), VLT (Ding et al., 2021), CRIS (Wang et al., 2022b), LAVT (Yang et al., 2022b), GRES (Liu et al., 2023a), X-Decoder (Zou et al., 2023a), SEEM (Zou et al., 2023b) and Poly Former (B/L) (Liu et al., 2023d). c Io U metric is employed to evaluate different methods. |
| Hardware Specification | Yes | For the NEx T-Chat 7B model, the stage-1 training takes 8 A100 (80G) GPUs for around 59 hours. |
| Software Dependencies | No | The paper mentions using specific models like 'CLIP Vi TL/14@336px', 'Vicuna-1.5 model', and 'SAM' but does not specify software dependencies with version numbers (e.g., PyTorch version, Python version, specific library versions). |
| Experiment Setup | Yes | The model is trained with a batch size of 64 and a learning rate of 2e-5 for 65k steps. During this pre-training stage, the entire language model, box encoder, and decoder are trained while keeping the image encoder frozen. The training loss is formulated as: Ls1 = Ltext + Ldet + Lcyc. ... For the NEx T-Chat 7B model, the stage-1 training takes 8 A100 (80G) GPUs for around 59 hours. |