GrounDiT: Grounding Diffusion Transformers via Noisy Patch Transplantation

Authors: Yuseung Lee, Taehoon Yoon, Minhyuk Sung

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In our experiments on the HRS and Draw Bench benchmarks, we achieve state-of-the-art performance compared to previous training-free approaches.
Researcher Affiliation Academia Phillip Y. Lee Taehoon Yoon Minhyuk Sung KAIST {phillip0701,taehoon,mhsung}@kaist.ac.kr
Pseudocode Yes Algorithm 1: Pseudocode of Joint Denoising (Sec. 5.2).
Open Source Code No Justification: While we have not provided the code yet, we provide a link to our project page, at which we will provide the official code for our method.
Open Datasets Yes In our experiments on the HRS [3] and Draw Bench [43] datasets, we evaluate our framework, GROUNDIT, using Pix Art-α [8] as the base text-to-image Di T model.
Dataset Splits No The HRS dataset consists of 1002, 501, and 501 images for each respective criterion, with bounding boxes generated using GPT-4 by Phung et al. [38]. No explicit train/validation/test splits are provided for their experimental setup.
Hardware Specification Yes All our experiments were conducted an NVIDIA RTX 3090 GPU.
Software Dependencies No The paper mentions using Pix Art-α [8] as the base text-to-image Di T model and the DPM-Solver scheduler [35], but does not provide specific software dependency versions (e.g., Python, PyTorch, CUDA versions).
Experiment Setup Yes For sampling we employed the DPM-Solver scheduler [35] with 50 steps. Out of the 50 denoising steps, we applied our GROUNDIT denoising step (Alg. 2) for the initial 25 steps, and applied the vanilla denoising step for the remaining 25 steps. For the grounding loss in Global Update of GROUNDIT, we adopted the definition proposed in R&B [47], and we set the loss scale to 10 and used a gradient descent weight of 5 for the gradient descent update in Eq. 7.