TextCtrl: Diffusion-based Scene Text Editing with Prior Guidance Control
Authors: Weichao Zeng, Yan Shu, Zhenhang Li, Dongbao Yang, Yu Zhou
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments demonstrate the effectiveness of Text Ctrl compared with previous methods concerning both style fidelity and text accuracy. |
| Researcher Affiliation | Academia | 1 Institute of Information Engineering, Chinese Academy of Sciences 2 VCIP & TMCC & DISSec, College of Computer Science, Nankai University 3 School of Cyber Security, University of Chinese Academy of Sciences {zengweichao, lizhenhang, yangdongbao}@iie.ac.cn shuyan9812@gamil.com, yzhou@nankai.edu.cn |
| Pseudocode | Yes | Algorithm 1 Glyph-adaptive Mutual Self-attention Input: Inversion latent z T source, reconstruction condition embedding csource, editing condition embedding cedit and target text embedding emby. Parameters: Time step t, interval τ, intensity parameter λ and µ. Output: Denoised latent z0 source and z0 edit. |
| Open Source Code | Yes | Project page: https://github.com/weichaozeng/Text Ctrl. |
| Open Datasets | Yes | Training Data. Based on [1, 3], we synthesize 200k paired text images for style disentanglement pre-training and supervised training of Text Ctrl, wherein each paired images are rendered with the same styles (i.e. font, size, colour, spatial transformation and background) and different texts, along with the corresponding segmentation mask and background image. Furthermore, a total of 730 fonts are employed to synthesize the visual text images in text glyph structure pre-training. (...) Specifically, we collect 1,280 image pairs with the text label from ICDAR 2013 [48], Hier Text [49] and MLT 2017 [50] |
| Dataset Splits | Yes | Scene Pair consists of 1,280 cropped text image pairs along with original full-size images enabling both style fidelity assessment and text rendering accuracy evaluation. (...) Text Encoder Scene Pair Scene Pair (Random) |
| Hardware Specification | Yes | Text Ctrl is trained on 4 NVIDIA A6000 GPU and the parameter sizes of each module are provided in Tab. 6. (...) The sampling step is set to T = 50 and the classifier-free guidance scale is set to ω = 2 with 7 seconds to generate an edited image on a single NVIDIA A6000 GPU. |
| Software Dependencies | No | The paper mentions using pre-trained model checkpoints (e.g., Stable Diffusion [17] V1-52, vision encoder3 [57], ABINet4 [47]) and provides links to their repositories, but it does not specify explicit version numbers for general software dependencies such as Python, PyTorch, or CUDA. |
| Experiment Setup | Yes | The training process utilizes a batch size of 256 with a learning rate of 1 10 5 and a total epoch of 100. (...) The sampling step is set to T = 50 and the classifier-free guidance scale is set to ω = 2 with 7 seconds to generate an edited image on a single NVIDIA A6000 GPU. |