Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
OmniVCus: Feedforward Subject-driven Video Customization with Multimodal Control Conditions
Authors: Yuanhao Cai, HE Zhang, Xi Chen, Jinbo Xing, Yiwei Hu, Yuqian Zhou, Kai Zhang, Zhifei Zhang, Soo Ye Kim, Tianyu Wang, Yulun Zhang, Xiaokang Yang, Zhe Lin, Alan L. Yuille
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments demonstrate that our method significantly surpasses state-of-the-art methods in both quantitative and qualitative evaluations. Project page is at https://caiyuanhao1998.github.io/project/Omni VCus/ [...] 4 Experiment [...] We compare Omni VCus with 4 SOTA methods [...] We also conduct a user study with 37 participants [...] We conduct experiments on single-subject video customization to study the steps of our Video Cus-Factory in Tab. 2a. |
| Researcher Affiliation | Collaboration | Yuanhao Cai1, He Zhang2, Xi Chen3, , Jinbo Xing4, Yiwei Hu2, Yuqian Zhou2, Kai Zhang2, Zhifei Zhang2, Soo Ye Kim2, Tianyu Wang2, Yulun Zhang5, , Xiaokang Yang5, Zhe Lin 2, Alan Yuille1 1 Johns Hopkins University, 2 Adobe Research, 3 The University of Hong Kong, 4 The Chinese University of Hong Kong, 5 Shanghai Jiao Tong University |
| Pseudocode | No | The paper describes the proposed methods, Video Cus-Factory and Omni VCus, through textual descriptions and diagrams (Figures 3 and 4) rather than explicit pseudocode or algorithm blocks. |
| Open Source Code | No | Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [No] Justification: Code and data are the assets of the company. We will apply for approval to release the code. |
| Open Datasets | Yes | For text-to-multiview, we select 320K samples from Objaverse [69] labeled with long and short text prompts as the training samples. We adopt the Omni Edit [70] as the image instructive editing dataset containing 1.2M data pairs. |
| Dataset Splits | No | In evaluation, we collect 112 samples for single-subject customization and instructive editing customization, 76/74/56 samples for double-/triple-/quadruple-subject customization, and 112 samples for camera-controlled subject-driven video customization. |
| Hardware Specification | Yes | Our model is fine-tuned from a T2V model with 5B parameters for 100K steps in total at a batch size of 356 on 64 A100 GPUs for 5 days. |
| Software Dependencies | No | The paper mentions using Adam W optimizer but does not specify any software libraries or frameworks with version numbers (e.g., PyTorch, TensorFlow, Python, CUDA). |
| Experiment Setup | Yes | Our model is fine-tuned from a T2V model with 5B parameters for 100K steps in total at a batch size of 356 on 64 A100 GPUs for 5 days. We adopt the Adam W optimizer [71] (β1 = 0.9, β2 = 0.95) with a weight decay of 0.1. The learning rate is linearly warmed up to 1e-5 with 2K iterations and decays to 1e-6 using cosine annealing [72]. The spatial resolution of training images and videos is set to 512x512 for text-to-multiview and image instructive editing and 384x640 for other tasks. The frame number and fps of the training video are set to 64 and 24. |