NExT-Chat: An LMM for Chat, Detection and Segmentation

Authors: Ao Zhang, Yuan Yao, Wei Ji, Zhiyuan Liu, Tat-Seng Chua

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Comprehensive experiments show the effectiveness of our NEx T-Chat on various tasks, e.g., NEx T-Chat (87.7) vs. Shikra (86.9) on POPE-Random, NEx TChat (71.3) vs. LISA (67.9) on referring expression segmentation task, and NEx T-Chat (79.6) vs. Kosmos-2 (62.3) on region caption task.
Researcher Affiliation Academia 1National University of Singapore 2Tsinghua University.
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code No The paper does not provide concrete access to source code for the methodology described.
Open Datasets Yes Stage-1. In stage-1, we perform pre-training using a mixture of data from various sources, including Flickr30K Entities (Plummer et al., 2015), Visual Genome (Krishna et al., 2017), Ref COCO (Yu et al., 2016), Ref COCO+ (Yu et al., 2016), Ref COCOg (Mao et al., 2016), VQAv2 (Antol et al., 2015), Point QA (Mani et al., 2020), Visual7W (Zhu et al., 2016), and VCR (Zellers et al., 2019).
Dataset Splits Yes To rigorously assess our model s proficiency in generating segmentation masks guided by natural language instructions, we use the referring expression segmentation (RES) splits of Ref COCO, Ref COCO+, and Ref COCOg. As for baselines, we choose both the LMM based methods (LISA (Lai et al., 2023) and GLa MM (Rasheed et al., 2023)) and non-LMM based methods including MCN (Luo et al., 2020), VLT (Ding et al., 2021), CRIS (Wang et al., 2022b), LAVT (Yang et al., 2022b), GRES (Liu et al., 2023a), X-Decoder (Zou et al., 2023a), SEEM (Zou et al., 2023b) and Poly Former (B/L) (Liu et al., 2023d). c Io U metric is employed to evaluate different methods.
Hardware Specification Yes For the NEx T-Chat 7B model, the stage-1 training takes 8 A100 (80G) GPUs for around 59 hours.
Software Dependencies No The paper mentions using specific models like 'CLIP Vi TL/14@336px', 'Vicuna-1.5 model', and 'SAM' but does not specify software dependencies with version numbers (e.g., PyTorch version, Python version, specific library versions).
Experiment Setup Yes The model is trained with a batch size of 64 and a learning rate of 2e-5 for 65k steps. During this pre-training stage, the entire language model, box encoder, and decoder are trained while keeping the image encoder frozen. The training loss is formulated as: Ls1 = Ltext + Ldet + Lcyc. ... For the NEx T-Chat 7B model, the stage-1 training takes 8 A100 (80G) GPUs for around 59 hours.