Energy-Based Cross Attention for Bayesian Context Update in Text-to-Image Diffusion Models
Authors: Geon Yeong Park, Jeongsol Kim, Beomsu Kim, Sang Wan Lee, Jong Chul Ye
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Using extensive experiments, we demonstrate that the proposed method is highly effective in handling various image generation tasks, including multi-concept generation, text-guided image inpainting, and real and synthetic image editing. |
| Researcher Affiliation | Academia | 1Bio and Brain Engineering, 2Kim Jaechul Graduate School of AI, 3Brain and Cognitive Sciences Korea Advanced Institute of Science and Technology (KAIST), Daejeon, Korea |
| Pseudocode | Yes | Algorithm 1 Context cascade at sampling step t; Algorithm 2 Energy-based Bayesian Context Update (EBCU); Algorithm 3 Energy-based Composition of Queries (EBCQ) |
| Open Source Code | Yes | Code: https: //github.com/Energy Attention/Energy-Based-Cross Attention. |
| Open Datasets | Yes | The SD is an LDM that is pre-trained on a subset of the large-scale image-language pair dataset, LAION-5B [34] followed by the subsequent fine-tuning on the LAION-aesthetic dataset. |
| Dataset Splits | No | The paper mentions training and testing but does not explicitly specify validation dataset splits (percentages or counts) or reference predefined validation splits. |
| Hardware Specification | Yes | All images are sampled for 50 steps via PNDM sampler [21] using NVIDIA RTX 2080Ti. |
| Software Dependencies | No | The paper mentions 'diffusers, a Python library' and 'pre-trained CLIP [29] model' but does not provide specific version numbers for Python, diffusers, or any other software dependencies. |
| Experiment Setup | Yes | In every experiment, we set the parameter α in Equation (9) to zero, focusing solely on controlling the values of γattn and γreg. EBCU is applied to every task, and EBCQ is additionally employed in C.4. For the proposed method, we set the γattn and γreg differently for each sample within [1e-2, 1.5e-2, 2e-2]. After converting the initial noise vector for real images or using a fixed random seed for synthetic images, EBCU and EBCQ are applied after a threshold time τs > 0 for the s-th editorial context. This scheduling strategy helps to preserve the overall structure of generated images during the editing process. In our observations, a value of τs [10, 25] generally produces satisfactory results, considering a total number of reverse steps set to 50. |