Controlling Text-to-Image Diffusion by Orthogonal Finetuning

Authors: Zeju Qiu, Weiyang Liu, Haiwen Feng, Yuxuan Xue, Yao Feng, Zhen Liu, Dan Zhang, Adrian Weller, Bernhard Schölkopf

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We empirically show that our OFT framework outperforms existing methods in generation quality and convergence speed.
Researcher Affiliation Collaboration 1MPI for Intelligent Systems Tübingen 2University of Cambridge 3University of Tübingen 4Mila, Université de Montréal 5Bosch Center for Artificial Intelligence 6The Alan Turing Institute
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code No The paper refers to code repositories for baseline implementations (e.g., Diffusers, Control Net, BLIP, MiDaS) but does not provide a direct link or explicit statement about releasing the source code for their own method (OFT).
Open Datasets Yes For training the convolutional autoencoder from Figure 2, we use 1000 random images from the Oxford 102 Flower dataset [40]. For the task of subject-driven generation, we use the official Dream Booth dataset... For the C2I task, we use the whole COCO 2017 dataset [31]... For the S2I task, we use the semantic segmentation dataset ADE20K [70]... For the L2F dataset, we use the Celeb A-HQ dataset [25]... For the P2I task, we use the Deep Fashion-Multi Modal dataset [24]... For the Sk2I task, we use a subset of the LAION-Aesthetics dataset...
Dataset Splits No The paper mentions using 'validation CLIP metrics' to select models and discusses evaluation on 'validation images' for some parts, but it does not provide specific percentages or counts for training/validation/test splits, nor does it detail a cross-validation setup.
Hardware Specification Yes We perform training on 1 Tesla V100-SXM2-32GB GPU using a learning rate of 6 × 10−5, batch size of 1, and train for approximately 1000 iterations. ... We perform training on 4 NVIDIA A100-SXM4-80GB GPUs using a learning rate of 1 × 10−5, batch size of 4 for L2I and batch size of 16 for the rest of tasks.
Software Dependencies No The paper mentions software components such as Diffusers, Control Net, BLIP, MiDaS, Segformer, and pytorch-fid, but it does not provide specific version numbers for these software dependencies.
Experiment Setup Yes We perform training on 1 Tesla V100-SXM2-32GB GPU using a learning rate of 6 × 10−5, batch size of 1, and train for approximately 1000 iterations. In the case of COFT, we use ϵ = 6 × 10−5 to constrain the orthogonal matrices. ... We perform training on 4 NVIDIA A100-SXM4-80GB GPUs using a learning rate of 1 × 10−5, batch size of 4 for L2I and batch size of 16 for the rest of tasks. For fine-tuning with COFT, we use ϵ = 1 × 10−3. ... For S2I, L2I and P2I, we fine-tune the model for 20 epochs; for C2I and D2I we fine-tune the model for 10 epochs; for Sk2I we fine-tune the model for 8 epochs.