Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
NExT-Chat: An LMM for Chat, Detection and Segmentation
Authors: Ao Zhang, Yuan Yao, Wei Ji, Zhiyuan Liu, Tat-Seng Chua
ICML 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Comprehensive experiments show the effectiveness of our NEx T-Chat on various tasks, e.g., NEx T-Chat (87.7) vs. Shikra (86.9) on POPE-Random, NEx TChat (71.3) vs. LISA (67.9) on referring expression segmentation task, and NEx T-Chat (79.6) vs. Kosmos-2 (62.3) on region caption task. |
| Researcher Affiliation | Academia | 1National University of Singapore 2Tsinghua University. |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide concrete access to source code for the methodology described. |
| Open Datasets | Yes | Stage-1. In stage-1, we perform pre-training using a mixture of data from various sources, including Flickr30K Entities (Plummer et al., 2015), Visual Genome (Krishna et al., 2017), Ref COCO (Yu et al., 2016), Ref COCO+ (Yu et al., 2016), Ref COCOg (Mao et al., 2016), VQAv2 (Antol et al., 2015), Point QA (Mani et al., 2020), Visual7W (Zhu et al., 2016), and VCR (Zellers et al., 2019). |
| Dataset Splits | Yes | To rigorously assess our model s proficiency in generating segmentation masks guided by natural language instructions, we use the referring expression segmentation (RES) splits of Ref COCO, Ref COCO+, and Ref COCOg. As for baselines, we choose both the LMM based methods (LISA (Lai et al., 2023) and GLa MM (Rasheed et al., 2023)) and non-LMM based methods including MCN (Luo et al., 2020), VLT (Ding et al., 2021), CRIS (Wang et al., 2022b), LAVT (Yang et al., 2022b), GRES (Liu et al., 2023a), X-Decoder (Zou et al., 2023a), SEEM (Zou et al., 2023b) and Poly Former (B/L) (Liu et al., 2023d). c Io U metric is employed to evaluate different methods. |
| Hardware Specification | Yes | For the NEx T-Chat 7B model, the stage-1 training takes 8 A100 (80G) GPUs for around 59 hours. |
| Software Dependencies | No | The paper mentions using specific models like 'CLIP Vi TL/14@336px', 'Vicuna-1.5 model', and 'SAM' but does not specify software dependencies with version numbers (e.g., PyTorch version, Python version, specific library versions). |
| Experiment Setup | Yes | The model is trained with a batch size of 64 and a learning rate of 2e-5 for 65k steps. During this pre-training stage, the entire language model, box encoder, and decoder are trained while keeping the image encoder frozen. The training loss is formulated as: Ls1 = Ltext + Ldet + Lcyc. ... For the NEx T-Chat 7B model, the stage-1 training takes 8 A100 (80G) GPUs for around 59 hours. |