reproducibilityindex.ai

DiffUTE: Universal Text Editing Diffusion Model

Authors: Haoxing Chen, Zhuoer Xu, Zhangxuan Gu, jun lan, 行郑, Yaohui Li, Changhua Meng, Huijia Zhu, Weiqiang Wang

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results show that our method achieves an impressive performance and enables controllable editing on in-the-wild images with high fidelity. We conduct extensive experiments to evaluate the performance of Diff UTE. Our method performs favorably over prior arts for text image editing, as measured by quantitative metrics and visualization.
Researcher Affiliation	Collaboration	Haoxing Chen1,2, Zhuoer Xu1 , Zhangxuan Gu1 Jun Lan1, Xing Zheng1, Yaohui Li2, Changhua Meng1, Huijia Zhu1, Weiqiang Wang1 1Ant Group 2Nanjing University
Pseudocode	No	The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code	Yes	Our code will be avaliable in https://github.com/chenhaoxing/DiffUTE.
Open Datasets	Yes	Due to the lack of large-scale datasets for generating text image compositions, we collect 5M images by combining the web-crawled data and publicly available text image datasets, including CLDA Li, XFUND Xu et al. [2022b], Pub Lay Net Zhong et al. [2019] and ICDAR series competitions Zhang et al. [2019], Nayef et al. [2019], Karatzas et al. [2015], to prepare our training dataset.
Dataset Splits	No	The paper describes a training dataset and a test set, but does not specify details for a validation split (e.g., percentages, sample counts, or specific pre-defined splits).
Hardware Specification	No	The paper does not provide specific hardware details such as GPU or CPU models, or other compute specifications used for running experiments.
Software Dependencies	No	The paper mentions several software components like Stable Diffusion, VAE, UNet, TROCR, CLIP, and Chat GLM, but does not provide specific version numbers for these or other software dependencies (e.g., Python, PyTorch, CUDA).
Experiment Setup	Yes	The VAE is trained for three epochs with a batch size of 48 and a learning rate of 5e-6. We set the batch size to 256, the learning rate to 1e-5, and the batch size to 5. All the images are cropped/resized to 512 x 512 resolution as model inputs. We propose a progressive training strategy (PTT) in which the size of the images used for training increases as the training proceeds. Specifically, in the first three stages of training, we randomly crop images of sizes S/8, S/4 and S/2 and resize them to S for training, where S is the resolution of the model input image and S=H=W. Thus, the tuned VAE can learn different sizes of stroke details and text recovery. In the fourth stage, we train with images of the same size as the VAE input to ensure that the VAE can predict accurately when inferring.