Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

UFO: A Unified Approach to Fine-grained Visual Perception via Open-ended Language Interface

Authors: Hao Tang, Chen-Wei Xie, Haiyang Wang, Xiaoyi Bao, Tingyu Weng, Pandeng Li, Yun Zheng, Liwei Wang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental After multi-task training on five standard visual perception datasets, UFO outperforms the previous state-of-the-art generalist models by 12.3 m AP on COCO instance segmentation and 3.3 m Io U on ADE20K semantic segmentation. Furthermore, our method seamlessly integrates with existing MLLMs, effectively combining fine-grained perception capabilities with their advanced language abilities, thereby enabling more challenging tasks such as reasoning segmentation. Code and models are available at https://github.com/nnnth/UFO. (Abstract) 5 Experiments (Section Title) 5.4 Ablation Study (Section Title)
Researcher Affiliation Collaboration 1Center for Data Science, Peking University 2Alibaba Group 3 CASIA 4 Center for Machine Learning Research, Peking University 5 State Key Laboratory of General Artificial Intelligence, Peking University, Beijing, China {tanghao@stu, wanghaiyang@stu, wanglw@cis}.pku.edu.cn EMAIL EMAIL
Pseudocode No The paper describes methods in paragraph text and provides diagrams (e.g., Figure 2) but does not include any explicitly labeled "Pseudocode" or "Algorithm" blocks.
Open Source Code Yes Code and models are available at https://github.com/nnnth/UFO.
Open Datasets Yes After multi-task training on five standard visual perception datasets, UFO outperforms the previous state-of-the-art generalist models by 12.3 m AP on COCO instance segmentation and 3.3 m Io U on ADE20K semantic segmentation. Datasets. We use the same multi-task dataset as Gi T: COCO 2017 [42] for object detection and instance segmentation, COCO Caption [10] for image captioning, the Ref COCO series [48, 83] for referring expression comprehension (REC), and ADE20K [90] for semantic segmentation.
Dataset Splits Yes This involves jointly training on a mixed dataset of the five tasks and directly testing on their respective validation or test sets.
Hardware Specification Yes In training, we use a batch size of 32 with gradient accumulation set to 16, running on 8 NVIDIA A100 GPUs for 120K iterations. Table 9: Speed is measured on UFO-Vi T-B, single A100 with batch size 1. Table 11: Speed is measured on an A100 GPU with batch size 1. Table 12: GPUS 24 V100 8 A100 8 A100
Software Dependencies No The paper mentions specific models and tokenizers (e.g., CLIP, Llama Tokenizer, Bert Tokenizer, Vi T, Vicuna, Intern Vi T, Intern LM2.5 Tokenizer) and the Adam W optimizer, but it does not provide specific version numbers for general software dependencies like Python, PyTorch, or CUDA.
Experiment Setup Yes Multi-Task Training Details. To facilitate comparison with specialist models, we also conduct single-task training independently on five selected tasks. For both single-task and multi-task training, we use a batch size of 24 and employ the Adam W [33] optimizer with a cosine annealing schedule, setting the initial learning rate to 0.0002. Fine-grained Instruction Tuning Details. In training, we use a batch size of 32 with gradient accumulation set to 16, running on 8 NVIDIA A100 GPUs for 120K iterations. The Adam W [33] optimizer and a cosine annealing schedule are employed, with a learning rate of 0.0002 and weight decay of 0.01. For efficient training, we employ Lo RA [27] with a rank of 8, freezing the image tokenizer while keeping only the LLM trainable. Table 12: Multi-task training and instruction tuning settings. config Multi-task (Vi T) Multi-task (MLLM) Instruction tuning optimizer Adam W Adam W Adam W learning rate 2e-4 2e-4 2e-4 weight decay 0.05 0.01 0.01 layer-wise lr decay 0.85 0.85 schedule cosine cosine cosine gradient norm clip 0.1 1.0 1.0 warmup iters 1k 1k 1k training iters 640k 400k 90k+30k batch size 24 24 32 gradient accumulation 16 Lo RA rank 8 Lo RA alpha 16 Lo RA dropout 0.05 Lo RA modules LLMs drop path 0.1(B), 0.4(L,H) precision FP16 BF16 BF16