TextCtrl: Diffusion-based Scene Text Editing with Prior Guidance Control

Authors: Weichao Zeng, Yan Shu, Zhenhang Li, Dongbao Yang, Yu Zhou

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments demonstrate the effectiveness of Text Ctrl compared with previous methods concerning both style fidelity and text accuracy.
Researcher Affiliation Academia 1 Institute of Information Engineering, Chinese Academy of Sciences 2 VCIP & TMCC & DISSec, College of Computer Science, Nankai University 3 School of Cyber Security, University of Chinese Academy of Sciences {zengweichao, lizhenhang, yangdongbao}@iie.ac.cn shuyan9812@gamil.com, yzhou@nankai.edu.cn
Pseudocode Yes Algorithm 1 Glyph-adaptive Mutual Self-attention Input: Inversion latent z T source, reconstruction condition embedding csource, editing condition embedding cedit and target text embedding emby. Parameters: Time step t, interval τ, intensity parameter λ and µ. Output: Denoised latent z0 source and z0 edit.
Open Source Code Yes Project page: https://github.com/weichaozeng/Text Ctrl.
Open Datasets Yes Training Data. Based on [1, 3], we synthesize 200k paired text images for style disentanglement pre-training and supervised training of Text Ctrl, wherein each paired images are rendered with the same styles (i.e. font, size, colour, spatial transformation and background) and different texts, along with the corresponding segmentation mask and background image. Furthermore, a total of 730 fonts are employed to synthesize the visual text images in text glyph structure pre-training. (...) Specifically, we collect 1,280 image pairs with the text label from ICDAR 2013 [48], Hier Text [49] and MLT 2017 [50]
Dataset Splits Yes Scene Pair consists of 1,280 cropped text image pairs along with original full-size images enabling both style fidelity assessment and text rendering accuracy evaluation. (...) Text Encoder Scene Pair Scene Pair (Random)
Hardware Specification Yes Text Ctrl is trained on 4 NVIDIA A6000 GPU and the parameter sizes of each module are provided in Tab. 6. (...) The sampling step is set to T = 50 and the classifier-free guidance scale is set to ω = 2 with 7 seconds to generate an edited image on a single NVIDIA A6000 GPU.
Software Dependencies No The paper mentions using pre-trained model checkpoints (e.g., Stable Diffusion [17] V1-52, vision encoder3 [57], ABINet4 [47]) and provides links to their repositories, but it does not specify explicit version numbers for general software dependencies such as Python, PyTorch, or CUDA.
Experiment Setup Yes The training process utilizes a batch size of 256 with a learning rate of 1 10 5 and a total epoch of 100. (...) The sampling step is set to T = 50 and the classifier-free guidance scale is set to ω = 2 with 7 seconds to generate an edited image on a single NVIDIA A6000 GPU.