Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Seg-VAR:Image Segmentation with Visual Autoregressive Modeling
Authors: Rongkun Zheng, Lu Qi, Xi Chen, Yi Wang, Kun Wang, Hengshuang Zhao
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments show Seg-VAR outperforms previous discriminative and generative methods on various segmentation tasks and validation benchmarks. By framing segmentation as a sequential hierarchical prediction task, Seg-VAR opens new avenues for integrating autoregressive reasoning into spatial-aware vision systems. We conduct extensive experimental evaluations on challenging image segmentation benchmarks, including COCO, Cityscapes, and ADE20K, and the achieved state-of-the-art results demonstrate the effectiveness and generality of the proposed approach and shed new light on the autoregressive modeling segmentation strategy. |
| Researcher Affiliation | Collaboration | Rongkun Zheng1 Lu Qi2 Xi Chen1 Yi Wang3,4 Kun Wang5 Hengshuang Zhao1 1The University of Hong Kong 2Insta360 3Shanghai Artificial Intelligence Laboratory 4Shanghai Innovation Institute 5Sense Time Research |
| Pseudocode | No | The paper describes the methodology using textual explanations and architectural diagrams (Figures 2, 3, 4) but does not include explicit pseudocode or algorithm blocks. |
| Open Source Code | No | Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [No] Justification: We will disclose the code after submission and acceptance. |
| Open Datasets | Yes | Datasets. We study Seg-VAR using four widely used image segmentation datasets that support semantic, instance and panoptic segmentation: COCO [44] (80 things and 53 stuff categories), ADE20K [80] (100 things and 50 stuff categories), and Cityscapes [18] (8 things and 11 stuff categories). |
| Dataset Splits | Yes | Datasets. We study Seg-VAR using four widely used image segmentation datasets that support semantic, instance and panoptic segmentation: COCO [44] (80 things and 53 stuff categories), ADE20K [80] (100 things and 50 stuff categories), and Cityscapes [18] (8 things and 11 stuff categories). ... Panoptic and instance segmentation... We use the standard Mask R-CNN inference setting... Semantic segmentation. We follow the same settings as [11] to train our models... The tables mention "COCO panoptic val2017", "COCO val2017", "Cityscapes val split", and "ADE20K val split", indicating the use of standard validation splits. |
| Hardware Specification | Yes | We operate all experiments with 8 V100 GPUs. ... Frames-per-second (fps) is measured on a V100 GPU with a batch size of 1... |
| Software Dependencies | No | We use Detectron2 [70] and follow the updated Mask R-CNN [27] baseline settings for the COCO dataset. (No specific version numbers are provided for Detectron2 or any other software dependencies). |
| Experiment Setup | Yes | We use AdamW [47] optimizer and the step learning rate schedule. We use an initial learning rate of 0.0001 and a weight decay of 0.05 for all backbones. A learning rate multiplier of 0.1 is applied to the backbone and we decay the learning rate at 0.9 and 0.95 fractions of the total number of training steps by a factor of 10. ... For data augmentation, we use the large-scale jittering (LSJ) augmentation [26, 20] with a random scale sampled from the range 0.1 to 2.0 followed by a fixed size crop to 1024 1024. ... For inference, we utilize top-k top-p sampling with k=900 and p=0.96 for encoding and decoding the seglat. |