DiffUTE: Universal Text Editing Diffusion Model
Authors: Haoxing Chen, Zhuoer Xu, Zhangxuan Gu, jun lan, 行 郑, Yaohui Li, Changhua Meng, Huijia Zhu, Weiqiang Wang
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results show that our method achieves an impressive performance and enables controllable editing on in-the-wild images with high fidelity. We conduct extensive experiments to evaluate the performance of Diff UTE. Our method performs favorably over prior arts for text image editing, as measured by quantitative metrics and visualization. |
| Researcher Affiliation | Collaboration | Haoxing Chen1,2, Zhuoer Xu1 , Zhangxuan Gu1 Jun Lan1, Xing Zheng1, Yaohui Li2, Changhua Meng1, Huijia Zhu1, Weiqiang Wang1 1Ant Group 2Nanjing University |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our code will be avaliable in https://github.com/chenhaoxing/DiffUTE. |
| Open Datasets | Yes | Due to the lack of large-scale datasets for generating text image compositions, we collect 5M images by combining the web-crawled data and publicly available text image datasets, including CLDA Li, XFUND Xu et al. [2022b], Pub Lay Net Zhong et al. [2019] and ICDAR series competitions Zhang et al. [2019], Nayef et al. [2019], Karatzas et al. [2015], to prepare our training dataset. |
| Dataset Splits | No | The paper describes a training dataset and a test set, but does not specify details for a validation split (e.g., percentages, sample counts, or specific pre-defined splits). |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU or CPU models, or other compute specifications used for running experiments. |
| Software Dependencies | No | The paper mentions several software components like Stable Diffusion, VAE, UNet, TROCR, CLIP, and Chat GLM, but does not provide specific version numbers for these or other software dependencies (e.g., Python, PyTorch, CUDA). |
| Experiment Setup | Yes | The VAE is trained for three epochs with a batch size of 48 and a learning rate of 5e-6. We set the batch size to 256, the learning rate to 1e-5, and the batch size to 5. All the images are cropped/resized to 512 x 512 resolution as model inputs. We propose a progressive training strategy (PTT) in which the size of the images used for training increases as the training proceeds. Specifically, in the first three stages of training, we randomly crop images of sizes S/8, S/4 and S/2 and resize them to S for training, where S is the resolution of the model input image and S=H=W. Thus, the tuned VAE can learn different sizes of stroke details and text recovery. In the fourth stage, we train with images of the same size as the VAE input to ensure that the VAE can predict accurately when inferring. |