Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

XVerse: Consistent Multi-Subject Control of Identity and Semantic Attributes via DiT Modulation

Authors: Bowen Chen, Brynn zhao, Haomiao Sun, Li Chen, Xu Wang, Daniel Du, Xinglong Wu

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our XVerse method achieves the highest Overall score of 67.76 , significantly outperforming all other compared methods. This clearly indicates a strong comprehensive advantage of our approach. In the single-subject generation category, XVerse demonstrates exceptional performance, securing the top AVG score of 71.13 . This underscores its robust capability in generating high-quality images focused on individual subjects. Notably, XVerse achieves the best identity similarity score (ID-Sim) of 63.02, suggesting superior preservation of subject identity. While Dream O leads in DPG with 97.51, XVerse s strong average performance, bolstered by competitive scores in IP-Sim (Object Similarity) at 71.35 and AES (Aesthetic Score) at 56.63, highlights its well-rounded excellence. Meanwhile, XVerse truly excels in the more challenging multi-subject generation tasks, achieving a leading AVG score of 64.38.
Researcher Affiliation Industry Intelligent Creation Team, Byte Dance EMAIL
Pseudocode No The paper describes the methods in text and with diagrams (Figure 2, Figure 3), but does not contain any clearly labeled pseudocode or algorithm blocks.
Open Source Code Yes Project Page: https://bytedance.github.io/XVerse; Github Link: https://github.com/bytedance/XVerse.
Open Datasets Yes We curated a 1M-scale dataset of images with resolutions exceeding 512 pixels from LAION [26]... For general objects, we leveraged the Subject200K [11] dataset. ... To comprehensively assess the models single-subject and multi-subject conditional generation and editing capabilities, we constructed a new benchmark by merging and curating data from Dream Bench++ [31] and Unsplash50 [32]. ... The benchmark used in the paper are all from open-source datasets.
Dataset Splits No The paper describes the composition of its training data and the structure of its evaluation benchmark (single/dual/triple-subject scenarios with 300 prompts) but does not provide specific training/validation/test splits (e.g., percentages or exact counts) for the primary training datasets used for XVerse's own training, which would be needed for exact reproduction.
Hardware Specification Yes Training was conducted on 64 NVIDIA A800 GPUs (40GB each)... Due to computational resource limitations, all experiments are executed on 16 NVIDIA A800 (40GB) GPUs.
Software Dependencies No The paper mentions key software components like FLUX.1-dev [6] and Lo RA [30], but it does not specify version numbers for these or any other software libraries (e.g., PyTorch, Python, CUDA) that would be essential for reproducible replication of the experiments.
Experiment Setup Yes In XVerse, we design a three-stage training pipeline... We employ Lo RA [30] with a rank of 128... The model was trained for 70K, 150K, and 10K iterations in the respective stages. Both the text-stream modulation adapter and Lo RA layers were optimized with a learning rate of 5e-6. The region preservation loss was assigned a weight of 10, while the text-image attention loss was weighted at 0.01.