Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Omni-R1: Reinforcement Learning for Omnimodal Reasoning via Two-System Collaboration
Authors: Hao Zhong, Muzhi Zhu, Zongze Du, Zheng Huang, Canyu Zhao, Mingyu Liu, Wen Wang, Hao Chen, Chunhua Shen
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on two challenging benchmarks, Referring Audio-Visual Segmentation (Ref AVS) and Reasoning Video Object Segmentation (REVOS), show that Omni-R1 not only surpasses strong supervised baselines but also outperforms specialized state-of-the-art models, while substantially improving out-of-domain generalization and mitigating multimodal hallucination. Our results demonstrate the first successful application of RL to large-scale omnimodal reasoning and highlight a scalable path toward universally foundation models. |
| Researcher Affiliation | Collaboration | 1 Zhejiang University, China 2 Ant Group |
| Pseudocode | No | The paper describes the system architecture and methods in detail, including mathematical formulations for rewards, but does not present any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our code is released at: https://github.com/aim-uofa/Omni-R1. |
| Open Datasets | Yes | We benchmark it on two especially demanding tasks, namely Referring Audio-Visual Segmentation (Ref AVS [25]) and Reasoning Video Object Segmentation (REVOS [26])... We train System 1 on 1,600 samples randomly selected from the Ref AVS [25] dataset and 2,600 videos from the Re VOS [26] and Me Vi S [46] datasets for 1 epoch. To further enhance the model s fine-grained understanding capabilities as system 2, we additionally train the model on 2,000 images from ref COCOg [47] for one epoch in the style of Seg Zero [48]. To systematically evaluate this issue, we conducted targeted assessments on audio-related hallucinations using the JUDGE subset of AVHBench [64], the first comprehensive benchmark designed to evaluate the perception and comprehension abilities of audio-visual large language models (LLMs). Omni-R1 achieves an average improvement of +2.0%, +2.7% and +3.7% over baseline on Omni Bench [36], Video MME [59] and MVBench [60] respectively, surpassing all other open-source omni-models. |
| Dataset Splits | Yes | We train System 1 on 1,600 samples randomly selected from the Ref AVS [25] dataset and 2,600 videos from the Re VOS [26] and Me Vi S [46] datasets for 1 epoch. To further enhance the model s fine-grained understanding capabilities as system 2, we additionally train the model on 2,000 images from ref COCOg [47] for one epoch in the style of Seg Zero [48]. We evaluated the performance of our collaborative system on Ref-AVSBench [25] with other Referring AVS methods. Omni-R1 outperforms previous SOTA EMMC [25] by +4.4% on J&F in seen set and +17.0% on unseen set. For VOS tasks, we adopt a random uniform sampling strategy during training, selecting between 8 and 24 frames per video to enhance temporal diversity and robustness. |
| Hardware Specification | No | The paper does not explicitly mention specific hardware details like GPU/CPU models or memory amounts used for the experiments within the provided text. |
| Software Dependencies | Yes | We adopt Qwen2.5-Omni-7B [7] as our base model... We adopt sam2-hiera-large as our SAM2 [44] version throughout the experiments. ...under the Adam W optimizer... |
| Experiment Setup | Yes | We train System 1 on 1,600 samples randomly selected from the Ref AVS [25] dataset and 2,600 videos from the Re VOS [26] and Me Vi S [46] datasets for 1 epoch. To further enhance the model s fine-grained understanding capabilities as system 2, we additionally train the model on 2,000 images from ref COCOg [47] for one epoch in the style of Seg Zero [48]. Unless otherwise specified, all experiments are conducted using a policy KL divergence hyperparameter of β = 0.04, a group size of 8, and an initial learning rate of 1 10 6 under the Adam W optimizer with a weight decay of 0.01. We adopt sam2-hiera-large as our SAM2 [44] version throughout the experiments. |