Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

FOCUS: Unified Vision-Language Modeling for Interactive Editing Driven by Referential Segmentation

Authors: Fan Yang, Yousong Zhu, Xin Li, Yufei Zhan, Hongyin Zhao, Shurong Zheng, Yaowei Wang, Ming Tang, Jinqiao Wang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments across three core tasks, including multimodal understanding, referring segmentation accuracy, and controllable image generation, demonstrate that FOCUS achieves strong performance by jointly optimizing visual perception and generative capabilities.
Researcher Affiliation Academia 1Foundation Model Research Center, Institute of Automation, Chinese Academy of Sciences, Haidian District, Beijing, China 2Peng Cheng Laboratory, Shenzhen, China 3School of Artificial Intelligence, University of Chinese Academy of Science, Beijing, China 4School of Artificial Intelligence, China University of Mining and Technology-Beijing, Beijing, China 5Wuhan AI Research, Wuhan, China EMAIL EMAIL, EMAIL, EMAIL
Pseudocode No The paper describes the methodology in text and illustrates it with figures, such as Figure 3 and Figure 5, which show the framework and training pipeline. However, there are no explicit sections or blocks labeled 'Pseudocode' or 'Algorithm' presenting structured steps in a code-like format.
Open Source Code Yes We provide partial core code in the supplemental material to demonstrate the openness and transparency of our method. Full code, pretrained models, and detailed instructions for reproducing the main results will be released after paper acceptance to ensure compliance with anonymity requirements during the review process.
Open Datasets Yes We utilize 45M image-text pairs from COYO, EMOVA, and LAION-2B... We train on 30M image-text pairs from COYO, EMOVAPretrain, and LLa VA-150K... We leverage 35M samples from Ultra Edit, SEED-Edit, and Any Edit for text-guided editing. An additional 3M segmentation annotations are sourced from Ref COCO-series, Ref Clef, and video datasets like DAVIS-2017 and You Tube-VIS2019. For dialogue and question answering, we include 5M samples from Magpie, Open Orca, SCP-116K, Open Hermes, and OPC-SFT-Stage1.
Dataset Splits Yes To evaluate the multimodal understanding capabilities of our model, we conduct systematic evaluations on two categories of widely-used benchmarks, as is show in table 1: (1) General benchmarks, including POPE, MMBench, SEED, MME-P, MM-Vet, MMMU, and AI2D; and (2) Documentoriented benchmarks, including VQA-text, Chart QA, Doc VQA, Info VQA, and OCRBench. ...We evaluate the referential segmentation performance of FOCUS on four standard benchmarks: Ref COCO, Ref COCO+, Ref COCOg, and g Ref COCO, using mean Intersection-over-Union (m Io U) as the evaluation metric. As shown in table 3, FOCUS achieves competitive or superior performance... Table 5: The effect of different image resolutions and multi-stage training strategy on model performance across various tasks. indicates higher is better, indicates lower is better. Image Size MJHQ30K Gen AI-Bench Image Understanding Ref COCO FID( ) Basic( ) Adv.( ) POPE( ) MMB( ) SEED( ) test A( ) Test B( ) Valid( )
Hardware Specification Yes The training of the Dual-Branch Visual Tokenizer and the diffusion decoder each took approximately 3 days on a computing cluster, while the 3B-parameter MLLM required around 13 days to complete the three-stage training process. ...We report the GPU type (e.g., A100), memory size, number of training hours, and cluster environment used for each major experiment in the appendix.
Software Dependencies No In FOCUS, we adopt Qwen2.5-3B [71] as the large language model (LLM)... a latent diffusion decoder initialized from SDXL... based on the Mo VQGAN [84] architecture... a Sim VQ [86] module... The mask decoder is Mask2Former [14]...
Experiment Setup Yes We employ the Adam W optimizer without weight decay and use a constant learning rate across the visual encoder, diffusion decoder, and the large vision language model. Detailed training hyperparameters for each component are summarized in the supplementary materials. Table 9: Training hyperparameters across different stages in FOCUS. Settings Visual Quantizer (Tokenizer) (Image Reconstruction) Projector Warmup (Projector Warmup) Multimodal Pretraining (Seg. Pretrain) Instruction Tuning (Instruction Tuning) Learning Rate 1e-4 (semantic) 2e-4 (pixel) 2e-5 1e-3 2e-5 (Visual encoder, LLM) 1e-3(Mask Decoder) 2e-5 (Visual encoder, LLM) 2e-6 (Mask Decoder, Diffusion) Batch Size 256 128 512 128 256 Training Steps 136k (pixel) 28k (semantic) 220k 1epoch 3epoch 1epoch Image Resolution 256 to 512 512 / 1024 256 256 / 512 512 to 1024 Frozen Modules Vanilla encoder Hierarchical Encoder All encoders Codebooks Visual Encoder, LLM Mask Decoder, Diffusion Diffusion Visual Encoder