Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
OmniGen-AR: AutoRegressive Any-to-Image Generation
Authors: Junke Wang, Xun Wang, Qiushan Guo, Peize Sun, Weilin Huang, Zuxuan Wu, Yu-Gang Jiang
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 4 Experiments 4.1 Experimental Setup 4.2 Comparison with State-of-the-arts 4.3 Ablation Studies |
| Researcher Affiliation | Collaboration | Junke Wang1,2, Xun Wang3, Qiushan Guo3, Peize Sun4, Weilin Huang3, Zuxuan Wu1,2 , Yu-Gang Jiang1,2 1Institute of Trustworthy Embodied AI, Fudan University 2Shanghai Collaborative Innovation Center of Intelligent Visual Computing 3Bytedance Seed, 4The University of Hong Kong |
| Pseudocode | No | The paper describes the model architecture and methods using text, equations, and figures (e.g., Figure 2 showing the model components), but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | We will release the full codebase upon publication, including implementation details and training scripts necessary to reproduce the main experimental results. |
| Open Datasets | Yes | The training of Omni Gen-AR includes three stages: 1) single image stage (SI), where we pretrain our model on large-scale image datasets, involving CC3M [62], CC12M [7], Open Images [36], SAM1B [33], and Megalith-huggingface [46]. We also incorporate video datasets, i.e., a 9M subset of Panda70M [11] and HD-VILA-100M [97], and randomly sample 1 frame for each video. ... 3) multi-task stage (MT), where we train our model on a widerange of high-quality datasets, including text-to-image datasets (Journey DB [65], Synthetic-dataset1M [53], and 10M internal data), image editing datasets (Magic Brush [105], Instruct-Pix2Pix [5], SEED-Edit [19]), depth-to-image datasets (Multi Gen-Depth [54]), segmentation-to-image datasets (Multi Gen-ADE20k [54] and Multi Gen-COCOStuff [54]), and text-to-video datasets (Open Sorapexels-45k [28], Open Vid-1M [50], and 0.5M high-quality internal data). |
| Dataset Splits | Yes | To evaluate the image generation capability given visual context (image condition), we choose two types of tasks: frame prediction on Kinetics-600 [6] and image editing on Emu-Edit test set [63]. We also evaluate Omni Gen-AR on VBench [29] for text-to-video generation. |
| Hardware Specification | Yes | We train our model on 64 A100 GPUs |
| Software Dependencies | Yes | We adopt Qwen2.5 [98] as the text tokenier and transformer model. While for visual tokenizer, we use an image-video joint tokenizer, i.e., Cosmos-DV8 16 16 [16]. |
| Experiment Setup | Yes | During the SI and IV stages, we train our model on 512 resolution, and the learning rate is set to 1e-4. While for the MT stage, we increase the resolution to 1024 and decrease the learning rate to 2e-5. We train our model on 64 A100 GPUs, the global batch size is 256 for all stages, no warm up or learning rate decay are used. Adam W [45] is employed for optimization. During the IV and MT stages, we replace the standard causal attention mask with a disentangled causal attention mask with a probability of 10%, and similarly drop the text conditions for classifier-free guidance with the same probability. We set CFG scale to 6.0 during inference. |