Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

ARGenSeg: Image Segmentation with Autoregressive Image Generation Model

Authors: Xiaolong Wang, Lixiang Ru, Ziyuan Huang, Kaixiang Ji, DanDan Zheng, Jingdong Chen, Jun Zhou

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments demonstrate that our method surpasses prior state-of-the-art approaches on multiple segmentation datasets with a remarkable boost in inference speed, while maintaining strong understanding capabilities. 4 Experiments 4.1 Experimental Setup 4.2 Referring Segmentation 4.3 Multumodal Understanding 4.4 Function Extension 4.5 Efficiency Analysis 4.6 Ablation Study
Researcher Affiliation Industry Xiaolong Wang, Lixiang Ru, Ziyuan Huang, Kaixiang Ji, Dandan Zheng Jingdong Chen, Jun Zhou Ant Group EMAIL EMAIL
Pseudocode No The paper describes the architecture and procedures in descriptive text and diagrams, but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: The data and the underlying model used in this article have provided detailed information and are publicly accessible.
Open Datasets Yes Datasets As described in Sec. 3.3, we perform a single-stage supervised finetuning to jointly train on both image segmentation and multimodal understanding data. Details of all datasets used are provided in Appendix A. The training of ARGen Seg relies entirely on publicly available external datasets. Table 7: Training data used in our experiments. Image Segmentation ADE20K(20K) [76], COCO-Panoptic(118K) [31], g Ref COCO (79K) [32], Ref COCO/+/g(127K) [37, 70], LISA++ Inst.Seg(58K) [69] Multimodal Understanding AI2D [25], Chart QA[38], COCO-Text[55], Doc VQA[18], LLa VA-150K[34], GQA[23], DVQA[24], OCR-VQA[39], Text VQA[47], Synth Do G-EN [26], Intern VL-SA1B-Caption [14], Visual Genome [28], Geo QA+[10] Image Generation Image Net-Instruct-class [78], Image Net-Instruct1270K [78]
Dataset Splits Yes Table 1: Performance comparison with state-of-the-art methods on three referring image segmentation benchmarks using c Io U. ... Ref COCO Ref COCO+ Ref COCOg val test A test B val test A test B val test ... 4.2 Referring Segmentation: We evaluate our approach on standard RES benchmarks Ref COCO/+/g [37, 70]. Following prior works [29, 57], we assess two versions of our model: one trained on the mixed dataset, and another further finetuned on the in-domain training sets of Ref COCO/+/g.
Hardware Specification Yes 4.5 Efficiency Analysis: We compare ARGen Seg with previous autoregressive generation models and MLLM-based segmentation methods in terms of inference time required to generate a 256 256 image or mask. All experiments are conducted using official implementations on an NVIDIA A100 GPU.
Software Dependencies No The paper mentions using Intern VL 2.5[13] as the MLLM backbone and VAR [50] for the visual tokenizer, and AdamW [36] optimizer, but it does not specify software versions for programming languages, libraries, or frameworks (e.g., Python, PyTorch, CUDA versions).
Experiment Setup Yes 4.1 Implementation Details: Our model accepts input images of arbitrary resolutions, while the output images are generated at the resolution of 256 256. The image tokenizer uses a downsampling ratio l = 16, with a feature dimension D = 32 and a visual codebook size V = 4096. The model operates with K = 10 scales. During training, we use the Adam W [36] optimizer with a maximum learning rate of 4 10 5 and employ cosine learning rate scheduling. The batch size is set to 128. ... For image generation, we further finetune the pre-trained ARGen Seg model using image generation data to unlock its text-to-image generation capabilities. ... We finetune ARGen Seg on 1.28M class-based samples from the Image Net Instruct-class dataset [78], using a batch size of 512 for 20k iterations. This successfully enables class-conditional image generation, as illustrated in Fig. 1. We then continue training for an additional 30k iterations with a batch size of 256 on the Image Net-Instruct1270K dataset [78], which is based on instruction-conditioned generation.