reproducibilityindex.ai

NExT-Chat: An LMM for Chat, Detection and Segmentation

Authors: Ao Zhang, Yuan Yao, Wei Ji, Zhiyuan Liu, Tat-Seng Chua

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Comprehensive experiments show the effectiveness of our NEx T-Chat on various tasks, e.g., NEx T-Chat (87.7) vs. Shikra (86.9) on POPE-Random, NEx TChat (71.3) vs. LISA (67.9) on referring expression segmentation task, and NEx T-Chat (79.6) vs. Kosmos-2 (62.3) on region caption task.
Researcher Affiliation	Academia	1National University of Singapore 2Tsinghua University.
Pseudocode	No	The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code	No	The paper does not provide concrete access to source code for the methodology described.
Open Datasets	Yes	Stage-1. In stage-1, we perform pre-training using a mixture of data from various sources, including Flickr30K Entities (Plummer et al., 2015), Visual Genome (Krishna et al., 2017), Ref COCO (Yu et al., 2016), Ref COCO+ (Yu et al., 2016), Ref COCOg (Mao et al., 2016), VQAv2 (Antol et al., 2015), Point QA (Mani et al., 2020), Visual7W (Zhu et al., 2016), and VCR (Zellers et al., 2019).
Dataset Splits	Yes	To rigorously assess our model s proficiency in generating segmentation masks guided by natural language instructions, we use the referring expression segmentation (RES) splits of Ref COCO, Ref COCO+, and Ref COCOg. As for baselines, we choose both the LMM based methods (LISA (Lai et al., 2023) and GLa MM (Rasheed et al., 2023)) and non-LMM based methods including MCN (Luo et al., 2020), VLT (Ding et al., 2021), CRIS (Wang et al., 2022b), LAVT (Yang et al., 2022b), GRES (Liu et al., 2023a), X-Decoder (Zou et al., 2023a), SEEM (Zou et al., 2023b) and Poly Former (B/L) (Liu et al., 2023d). c Io U metric is employed to evaluate different methods.
Hardware Specification	Yes	For the NEx T-Chat 7B model, the stage-1 training takes 8 A100 (80G) GPUs for around 59 hours.
Software Dependencies	No	The paper mentions using specific models like 'CLIP Vi TL/14@336px', 'Vicuna-1.5 model', and 'SAM' but does not specify software dependencies with version numbers (e.g., PyTorch version, Python version, specific library versions).
Experiment Setup	Yes	The model is trained with a batch size of 64 and a learning rate of 2e-5 for 65k steps. During this pre-training stage, the entire language model, box encoder, and decoder are trained while keeping the image encoder frozen. The training loss is formulated as: Ls1 = Ltext + Ldet + Lcyc. ... For the NEx T-Chat 7B model, the stage-1 training takes 8 A100 (80G) GPUs for around 59 hours.