GlyphControl: Glyph Conditional Control for Visual Text Generation

Authors: Yukang Yang, Dongnan Gui, YUHUI YUAN, Weicong Liang, Haisong Ding, Han Hu, Kai Chen

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate the effectiveness of our approach by measuring OCRbased metrics, CLIP score, and FID of the generated visual text. Our empirical evaluations demonstrate that Glyph Control outperforms the recent Deep Floyd IF approach in terms of OCR accuracy, CLIP score, and FID, highlighting the efficacy of our method. We conduct thorough experiments and show that our approach consistently achieves much higher OCR accuracy than the Deep Floyd IF.
Researcher Affiliation Collaboration Yukang Yang1 Dongnan Gui2 Yuhui Yuan3 Weicong Liang3 Haisong Ding3 Han Hu3 Kai Chen3 1Princeton University 2University of Science and Technology of China 3Microsoft Research Asia
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code Yes https://github.com/AIGText/Glyph Control-release
Open Datasets No To construct this benchmark, we start with LAION-2B-en, which is a subset of LAION-5B [32], and selectively choose specimens that exhibit abundant visual text content using the PP-OCR engine. As a result, we have curated a high-quality LAION-Glyph dataset consisting of 10 million images. This dataset includes detailed OCR information and captions that are well-formed and accurate. However, the paper does not explicitly state that the constructed LAION-Glyph dataset itself is publicly available or provide a link for its access.
Dataset Splits No The paper partitions the LAION-Glyph dataset into three scales (100K, 1M, 10M) for training and uses separate evaluation benchmarks (Simple Bench, Creative Bench) for testing. It does not specify a distinct validation split for hyperparameter tuning or early stopping during training.
Hardware Specification No The paper mentions that the 'stable-diffusion-2-base' model, which they used as a foundation, 'costs hundreds of hours with 128 A100 GPUs' for its training. However, it does not provide any specific hardware details (GPU models, CPU, memory, or cluster specifications) for their own experiments or training of Glyph Control.
Software Dependencies Yes Our training process incorporates PP-OCRv3 [6] as the OCR engine. For rendering glyphs, we leverage the tools available in the Image Draw module of the Python library Pillow.
Experiment Setup Yes We train our framework on three different dataset scales: LAION-Glyph-100K, LAION-Glyph-1M, and LAION-Glyph-10M for 60 epochs, 20 epochs, and 6 epochs, respectively. For both the Glyph Control Net and Zero-Conv blocks, we set the base learning rate to 1e-4. The U-Net encoder and decoder are both kept frozen during training. The caption dropping rates for the SD branch and Glyph Control Net branch are set to 0.1 and 0.5, respectively. The input images are maintained at a resolution of 512 512. The scale of classifier-free guidance is set as 9 while we take the empty string as the negative prompt. We use the DDIM[35] sampler with 20 sampling steps.