Energy-Based Cross Attention for Bayesian Context Update in Text-to-Image Diffusion Models

Authors: Geon Yeong Park, Jeongsol Kim, Beomsu Kim, Sang Wan Lee, Jong Chul Ye

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Using extensive experiments, we demonstrate that the proposed method is highly effective in handling various image generation tasks, including multi-concept generation, text-guided image inpainting, and real and synthetic image editing.
Researcher Affiliation Academia 1Bio and Brain Engineering, 2Kim Jaechul Graduate School of AI, 3Brain and Cognitive Sciences Korea Advanced Institute of Science and Technology (KAIST), Daejeon, Korea
Pseudocode Yes Algorithm 1 Context cascade at sampling step t; Algorithm 2 Energy-based Bayesian Context Update (EBCU); Algorithm 3 Energy-based Composition of Queries (EBCQ)
Open Source Code Yes Code: https: //github.com/Energy Attention/Energy-Based-Cross Attention.
Open Datasets Yes The SD is an LDM that is pre-trained on a subset of the large-scale image-language pair dataset, LAION-5B [34] followed by the subsequent fine-tuning on the LAION-aesthetic dataset.
Dataset Splits No The paper mentions training and testing but does not explicitly specify validation dataset splits (percentages or counts) or reference predefined validation splits.
Hardware Specification Yes All images are sampled for 50 steps via PNDM sampler [21] using NVIDIA RTX 2080Ti.
Software Dependencies No The paper mentions 'diffusers, a Python library' and 'pre-trained CLIP [29] model' but does not provide specific version numbers for Python, diffusers, or any other software dependencies.
Experiment Setup Yes In every experiment, we set the parameter α in Equation (9) to zero, focusing solely on controlling the values of γattn and γreg. EBCU is applied to every task, and EBCQ is additionally employed in C.4. For the proposed method, we set the γattn and γreg differently for each sample within [1e-2, 1.5e-2, 2e-2]. After converting the initial noise vector for real images or using a fixed random seed for synthetic images, EBCU and EBCQ are applied after a threshold time τs > 0 for the s-th editorial context. This scheduling strategy helps to preserve the overall structure of generated images during the editing process. In our observations, a value of τs [10, 25] generally produces satisfactory results, considering a total number of reverse steps set to 50.