Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
FlexEControl: Flexible and Efficient Multimodal Control for Text-to-Image Generation
Authors: Xuehai He, Jian Zheng, Jacob Zhiyuan Fang, Robinson Piramuthu, Mohit Bansal, Vicente Ordonez, Gunnar A Sigurdsson, Nanyun Peng, Xin Eric Wang
TMLR 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In the realm of text-to-image (T2I) generation, diffusion models exhibit exceptional performance in transforming textual descriptions into visually accurate images. Such models exhibit extraordinary potential across a plethora of applications, spanning from content creation (Rombach et al., 2022; Saharia et al., 2022b; Nichol et al., 2021; Ramesh et al., 2021a; Yu et al., 2022; Avrahami et al., 2023; Chang et al., 2023), image editing (Balaji et al., 2022; Kawar et al., 2023; Couairon et al., 2022; Zhang et al., 2023; Valevski et al., 2022; Nichol et al., 2021; Hertz et al., 2022; Brooks et al., 2023; Mokady et al., 2023), and also fashion design (Cao et al., 2023). We propose a new unified method that can tackle two problems in text-to-image generation: improve the training efficiency of T2I models concerning memory usage, computational requirements, and a thirst for extensive datasets (Saharia et al., 2022a; Rombach et al., 2022; Ramesh et al., 2021b); and improve their controllability especially when dealing with multimodal conditioning, e.g. multiple edge maps and at the same time follow the guidance of text prompts, as shown in Figure 1 (c). |
| Researcher Affiliation | -1 | The paper states "Anonymous authors Paper under double-blind review". Therefore, no affiliation information is available to classify the author affiliation types. |
| Pseudocode | No | The paper describes the methodology using textual explanations and mathematical equations for loss functions (Lldm, Lca, Lmask, Ltotal) and architectural components (Kronecker Decomposition, low-rank decomposition), but it does not include any clearly labeled pseudocode or algorithm blocks with structured steps. |
| Open Source Code | No | The paper mentions using and comparing against existing models and frameworks like "Uni-Control Net (Zhao et al., 2023)", "Stable Diffusion 1.5", and "Stable Diffusion 2.1 model" available at Hugging Face. However, there is no explicit statement from the authors about releasing the source code for their proposed method, Flex EControl, nor is there a link to a code repository or a mention of code in supplementary materials. |
| Open Datasets | Yes | In pursuit of our objective of achieving controlled Text-to-Image (T2I) generation, we employed the LAION improved_aesthetics_6plus (Schuhmann et al., 2022) dataset for our model training. |
| Dataset Splits | No | The paper mentions curating a "subset comprising 5,082,236 instances" from the LAION dataset for training, and using a "subset of 100,000 training samples" for ablation studies. For quantitative assessment, "a subset comprising 10,000 high-quality images from the LAION improved_aesthetics_6.5plus dataset was utilized." While it specifies dataset sizes and subsets used for different purposes, it does not explicitly provide details about training/validation/test splits, such as percentages or specific partitioning methodology, for their experiments. |
| Hardware Specification | Yes | The fine-tuning process was executed on P3 AWS EC2 instances equipped with 64 NVIDIA V100 GPUs. |
| Software Dependencies | No | The paper mentions using "Stable Diffusion 1.5", "Adam W optimizer (Kingma and Ba, 2014)", "gpt-3.5-turbo", "CLIPSeg (Lüddecke and Ecker, 2022) (clipseg-rd64-refined version)", and "Stable Diffusion 2.1 model". While these are specific tools or models, the paper does not list key general software dependencies with their version numbers, such as programming languages (e.g., Python), deep learning frameworks (e.g., PyTorch, TensorFlow), or GPU-specific libraries (e.g., CUDA) which are crucial for full reproducibility. |
| Experiment Setup | Yes | Our model underwent training for a singular epoch, employing the Adam W optimizer (Kingma and Ba, 2014) with a learning rate set at 10^-5. Throughout all experimental iterations, we standardized the dimensions of input and conditional images to 512 x 512. We also study the impact of λca and λmask trained on the subset of 100,000 samples from LAION improved_aesthetics_6plus for 6,000 steps. We evaluated the score on SSIM of canny edge maps and m Io U of segmentation maps, results are shown in Figure 4. As observed, Flex EControl achieves optimal performance when both λca = 0.01 and λmask = 0.01. |