Controlling Text-to-Image Diffusion by Orthogonal Finetuning
Authors: Zeju Qiu, Weiyang Liu, Haiwen Feng, Yuxuan Xue, Yao Feng, Zhen Liu, Dan Zhang, Adrian Weller, Bernhard Schölkopf
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We empirically show that our OFT framework outperforms existing methods in generation quality and convergence speed. |
| Researcher Affiliation | Collaboration | 1MPI for Intelligent Systems Tübingen 2University of Cambridge 3University of Tübingen 4Mila, Université de Montréal 5Bosch Center for Artificial Intelligence 6The Alan Turing Institute |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper refers to code repositories for baseline implementations (e.g., Diffusers, Control Net, BLIP, MiDaS) but does not provide a direct link or explicit statement about releasing the source code for their own method (OFT). |
| Open Datasets | Yes | For training the convolutional autoencoder from Figure 2, we use 1000 random images from the Oxford 102 Flower dataset [40]. For the task of subject-driven generation, we use the official Dream Booth dataset... For the C2I task, we use the whole COCO 2017 dataset [31]... For the S2I task, we use the semantic segmentation dataset ADE20K [70]... For the L2F dataset, we use the Celeb A-HQ dataset [25]... For the P2I task, we use the Deep Fashion-Multi Modal dataset [24]... For the Sk2I task, we use a subset of the LAION-Aesthetics dataset... |
| Dataset Splits | No | The paper mentions using 'validation CLIP metrics' to select models and discusses evaluation on 'validation images' for some parts, but it does not provide specific percentages or counts for training/validation/test splits, nor does it detail a cross-validation setup. |
| Hardware Specification | Yes | We perform training on 1 Tesla V100-SXM2-32GB GPU using a learning rate of 6 × 10−5, batch size of 1, and train for approximately 1000 iterations. ... We perform training on 4 NVIDIA A100-SXM4-80GB GPUs using a learning rate of 1 × 10−5, batch size of 4 for L2I and batch size of 16 for the rest of tasks. |
| Software Dependencies | No | The paper mentions software components such as Diffusers, Control Net, BLIP, MiDaS, Segformer, and pytorch-fid, but it does not provide specific version numbers for these software dependencies. |
| Experiment Setup | Yes | We perform training on 1 Tesla V100-SXM2-32GB GPU using a learning rate of 6 × 10−5, batch size of 1, and train for approximately 1000 iterations. In the case of COFT, we use ϵ = 6 × 10−5 to constrain the orthogonal matrices. ... We perform training on 4 NVIDIA A100-SXM4-80GB GPUs using a learning rate of 1 × 10−5, batch size of 4 for L2I and batch size of 16 for the rest of tasks. For fine-tuning with COFT, we use ϵ = 1 × 10−3. ... For S2I, L2I and P2I, we fine-tune the model for 20 epochs; for C2I and D2I we fine-tune the model for 10 epochs; for Sk2I we fine-tune the model for 8 epochs. |