Ctrl-X: Controlling Structure and Appearance for Text-To-Image Generation Without Guidance

Authors: Kuan Heng Lin, Sicheng Mo, Ben Klingher, Fangzhou Mu, Bolei Zhou

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive qualitative and quantitative experiments illustrate the superior performance of Ctrl-X on various condition inputs and model checkpoints.
Researcher Affiliation Collaboration Kuan Heng Lin1* Sicheng Mo1* Ben Klingher1 Fangzhou Mu2 Bolei Zhou1 1University of California, Los Angeles 2NVIDIA
Pseudocode No The paper describes the method using prose and diagrams (Figure 3), but it does not include any formal pseudocode blocks or algorithm listings.
Open Source Code Yes We publicly release our code and our data (for quantitative evaluation) at https://github.com/genforce/ctrl-x.
Open Datasets Yes We publicly release our dataset in our code release: https://github.com/genforce/ctrl-x. Our dataset consists of 177 1024 1024 images divided into 16 types and across 7 categories.
Dataset Splits No The paper mentions creating a new dataset for evaluation ('256 diverse structure-appearance pairs') and describes its composition. However, it does not specify explicit training, validation, or test dataset splits for model evaluation. It only mentions selecting '15 sample pairs' for a user study, which is not a general dataset split for the main quantitative evaluations.
Hardware Specification Yes We implement Ctrl-X with Diffusers [37] and run all experiments on a single NVIDIA A6000 GPU, except evaluating inference efficiency in Table 1 where we run on a single NVIDIA H100 GPU.
Software Dependencies No The paper states, 'We implement Ctrl-X with Diffusers [37]', citing the Diffusers library. However, it does not specify a particular version number for Diffusers or any other software dependency, which is required for reproducibility.
Experiment Setup Yes For SDXL, we set Lfeat = {0}decoder, Lself = {0, 1, 2}decoder, Lapp = {1, 2, 3, 4}decoder {2, 3, 4, 5}encoder, and τ s = τ a = 0.6. We sample Io with 50 steps of DDIM sampling and set η = 1 [33], doing self-recurrence for nr = 2 for τ r 0 = 0.1 and τ r 1 = 0.5.