OneActor: Consistent Subject Generation via Cluster-Conditioned Guidance

Authors: Jiahao Wang, Caixia Yan, Haonan Lin, Weizhan Zhang, Mengmeng Wang, Tieliang Gong, Guang Dai, Hao Sun

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Comprehensive experiments show that our method outperforms a variety of baselines with satisfactory subject consistency, superior prompt conformity as well as high image quality.
Researcher Affiliation Collaboration 1School of Computer Science and Technology, MOEKLINNS, Xi an Jiaotong University 2State Key Laboratory of Communication Content Cognition 3College of Computer Science and Technology, Zhejiang University of Technology 4SGIT AI Lab, State Grid Corporation of China 5China Telecom Artificial Intelligence Technology Co.Ltd
Pseudocode Yes G Pseudo-Code Algorithm 1: Tuning Process of One Actor
Open Source Code Yes We submitted codes in the supplementary materials. We will release the codes officially online with detailed instructions after preparations.
Open Datasets No The paper trains on data generated from a pre-trained diffusion model and augmented target/auxiliary samples, rather than a publicly available dataset with specific access information.
Dataset Splits No The paper does not explicitly provide training/test/validation dataset splits with specific percentages or counts. The tuning process uses generated data and a convergence criterion.
Hardware Specification Yes All experiments are finished on a single NVIDIA A100 GPU.
Software Dependencies No The paper lists various software implementations and models used (e.g., SDXL, Dream Booth-Lo RA, CLIP), but does not explicitly provide specific version numbers for general software dependencies like Python, PyTorch, or CUDA.
Experiment Setup Yes All images are generated in 30 steps. During tuning, we generate N = 11 base images. We use K = 3 auxiliary images each batch. The projector consists of a 5-layer Res Net, 1 linear layer and 1 Ada IN layer. The weight hyper-parameters λ1 and λ2 are set to 0.5 and 0.2. We use the default Adam W optimizer with a learning rate of 0.0001, weight decay of 0.01. We tune with a convergence criterion, which takes 3-6 minutes in most cases. During inference, we set the semantic interpolation scale v to 0.8 if not specified. We set the cluster guidance scale η1 and η2 to 8.5 and 1.0. We apply cluster guidance to the first 20 inference steps and normal guidance to the last 10 steps.