Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Native Segmentation Vision Transformers

Authors: Guillem Brasó, Aljosa Osep, Laura Leal-Taixé

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In Section 4.2, we start with mask-free supervision and study emerging segmentation from image-class (Section 4.1.1) and image-caption (Section 4.1.2) supervision, comparing our model to state-of-the-art zero-shot segmentation methods. In Section 4.2, we train and evaluate our model on standard datasets and benchmarks for semantic (Section 4.2.1) and panoptic (Section 4.2.2) segmentation, comparing our direct segmentation model and backbone as drop-in replacement against state-of-the-art. We analyze our design choices and contributions in Section 4.3.
Researcher Affiliation Industry Guillem Brasó Aljoša Ošep Laura Leal-Taixé research.nvidia.com/labs/dvl/projects/native-segmentation
Pseudocode Yes Algorithm 1 Grouping layer over an input feature map X for L iterations with sparsity. Algorithm 2 Efficient implementation of sparse Spatial Grouping Layer cross-attention operation
Open Source Code No Upon acceptance, we will release all code and pre-trained models. All the datasets used by this paper are publicly available, but we are not releasing code at the time of submission. We are, however, committed to releasing code and trained models upon acceptance.
Open Datasets Yes ADE20k [20] and COCO-panoptic [21] Image Net-1k and Image Net-22k [41] CC3M [45] and CC12M [46] datasets Red Caps12M dataset [47] Pascal VOC [49], Pascal Context [50], COCO [51], COCO-Stuff [52], ADE20k [20] and Cityscapes [53]
Dataset Splits Yes We train models to classify pixels into 150 semantic classes on ADE20k dataset [20], and, following common practice, report results on the validation set. We train and evaluate models on COCO-panoptic [21], which consists of 80 object (things) and 53 background (stuff) classes, requiring models to predict semantic classes and instance IDs for things. fine-tune these models for 30 additional epochs on the Image Net-1k dataset at 384 × 384 resolution and report top-1 accuracy on Image Net-1k-val.
Hardware Specification Yes Trainings take approximately 36 hours on 8 A100 GPUs. Pre-training runs are conducted on 16 A100 GPUs for approximately a week, and fine-tuning on Image Net-1k takes approximately 6 hours on 8 A100s. We evaluate the computational efficiency of our method on an NVIDIA A100 GPU with 40GB of VRAM using batch size one and full FP32 precision for our base model variant across multiple input resolutions in Table 8, and at standard 512 × 512 for our Table 9.
Software Dependencies No using PyTorch. (Algorithm 2) We leverage the sliding window attention CUDA kernels introduced in the natten [3] library. Our model is implemented with mmsegmentation.
Experiment Setup Yes We train our models from scratch for 300 epochs at resolution 224 × 224, following all training hyperparameteres, including optimizer, learning rate scheduler, and augmentation setting of [1]. However, we disable Mix Up augmentation as it degrades our results. The drop is likely caused by the ambiguity that alpha composite images introduce in our grouping layer. Unlike [3], we do not train for additional cooldown epochs. Following [1, 3], we use stochastic depth for regularization [74], with default survival probabilities of 0.3 and 0.5 for our tiny and base variants, respectively. We train for 20 epochs with initial learning rate 3 × 10−4 and a use a cosine decay scheduler leading to a minimum learning rate of 3 × 10−5. We apply linear learning rate warmup during the first 3k iterations, and train for a total of 20 epochs with a batch size of 4096 (approx. 68k iterations). As with Image Net models, we use stochastic depth for regularization [74], with survival probabilities set to 0.2 and 0.3 for our tiny and base model, respectively.