Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

UniPixel: Unified Object Referring and Segmentation for Pixel-Level Visual Reasoning

Authors: Ye Liu, Zongyang Ma, Junfu Pu, Zhongang Qi, Yang Wu, Ying Shan, Chang Chen

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate the effectiveness of Uni Pixel by conducting extensive experiments across a diverse set of benchmarks. Specifically, we study the following research questions. Q1. Whether Uni Pixel is flexible and effective on basic image/video referring and segmentation tasks compared to the corresponding representative methods? Q2. How does it perform on the more challenging Pixel QA task, which requires joint referring, segmentation, and question answering in videos? Q3. What effects does each architectural design contribute? More importantly, does the unified modeling of referring and segmentation lead to a mutual reinforcement effect? Detailed information about the benchmarks, evaluation metrics, implementation details, and more experimental results can be found in the appendix.
Researcher Affiliation Collaboration 1 The Hong Kong Polytechnic University 2 ARC Lab, Tencent PCG 3 Institute of Automation, Chinese Academy of Sciences 4 vivo Mobile Communication Co. 5 Mind Wingman Technology (Shenzhen) Co., Ltd.
Pseudocode No The paper describes methods and processes through textual descriptions and architectural diagrams (e.g., Figure 3: The architecture of Uni Pixel), but it does not contain any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes We open-source all the code, checkpoints, data, and training logs to ensure full reproducibility.
Open Datasets Yes The datasets are listed in Tab. 12. In the first stage, we pre-train the sparse prompt encoder using 851K regional captioning data. Then, we align the LLM and mask decoder by training the L M projector on 87K referring segmentation data. In the last stage, we further unfreeze the M L projector and mask decoder, and apply Lo RA [26] on the visual encoder and LLM. The model is jointly trained on a large-scale corpus with around 1M samples for diverse tasks. The detailed distribution of training datasets for Uni Pixel is shown in Tab. 12. Within the three-stage training recipe, we first pre-train the sparse prompt encoder using short caption samples from Inst-IT [61] and Video Refer [103].
Dataset Splits Yes We evaluate the effectiveness of Uni Pixel from two aspects, i.e., basic referring/segmentation capabilities and flexible pixel-level reasoning capabilities. For the first aspect, we conduct extensive experiments on 10 public benchmarks across 9 image/video referring/segmentation tasks. Our method achieves state-of-the-art performance in diverse scenarios. Notably, on the challenging video reasoning segmentation and referred video QA tasks, our 3B model obtains 62.1 J &F on Re VOS [96] and 72.8% Acc on Video Refer-Bench Q [103], surpassing strong counterparts with 7B 13B parameters.
Hardware Specification Yes We train the model with 8 RTX A6000 Ada (48G) GPUs, with a global batch size of 256 for stages 1 and 2, and 32 for stage 3.
Software Dependencies Yes We instantiate our base models with 3B and 7B versions of Qwen2.5-VL [3]. Both variants employ pre-trained SAM 2.1 [66] with Hiera Base+ [70] backbone as the mask decoder.
Experiment Setup Yes We instantiate our base models with 3B and 7B versions of Qwen2.5-VL [3]. Both variants employ pre-trained SAM 2.1 [66] with Hiera Base+ [70] backbone as the mask decoder. The M L projector is initialized with the weights from the V L projector of Qwen2.5-VL. The hidden size inside the prompt encoder is 256. To reduce GPU memory and accelerate training, we randomly sample 8 frames per video, with each frame resized to 3162 4482 pixels (128 256 tokens per frame). The frame sampling strategies follow the specifications of each benchmark during inference. The mask decoder has a fixed resolution of 768 768. For each segmentation sample, up to 5 objects are randomly selected to compute the mask prediction losses. During training, Lo RA adapters [26] with rank=128 and alpha=256 are applied to all QKVO layers in the visual encoder and LLM. The input sequences are restricted to 4K tokens. We train the model with 8 RTX A6000 Ada (48G) GPUs, with a global batch size of 256 for stages 1 and 2, and 32 for stage 3. In the first two stages, the learning rates are set to 1e-3. In the last stage, it is set to 5e-6 for the mask decoder and 2e-5 for all the other parameters, respectively. A linear warmup in the first 3% steps followed by cosine decay is adopted in all stages. The training loss for Uni Pixel is a linear combination of language modeling loss and mask decoding losses [66], including a focal loss and dice loss for mask prediction, a mean-absolute-error (MAE) loss for Io U prediction, and a cross-entropy loss for objectness prediction. The loss weights are set to 1, 100, 5, 5, and 5, respectively.