TextDiffuser: Diffusion Models as Text Painters

Authors: Jingye Chen, Yupan Huang, Tengchao Lv, Lei Cui, Qifeng Chen, Furu Wei

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through experiments and user studies, we show that Text Diffuser is flexible and controllable to create high-quality text images using text prompts alone or together with text template images, and conduct text inpainting to reconstruct incomplete images with text. The code, model, and dataset will be available at https://aka.ms/textdiffuser.
Researcher Affiliation Collaboration Jingye Chen 13, Yupan Huang 23, Tengchao Lv3, Lei Cui3, Qifeng Chen1, Furu Wei3 1HKUST 2Sun Yat-sen University 3Microsoft Research
Pseudocode No The paper describes its method in prose and flowcharts but does not include any formal pseudocode or algorithm blocks.
Open Source Code Yes The code, model, and dataset will be available at https://aka.ms/textdiffuser.
Open Datasets Yes To train our model, we use OCR tools and design filtering strategies to obtain 10 million high-quality image-text pairs with OCR annotations (dubbed as MARIO-10M), each with recognition, detection, and character-level segmentation annotations. We collect 10 million image-text pairs with OCR annotations to construct the MARIO-10M Dataset. The code, model, and dataset will be available at https://aka.ms/textdiffuser.
Dataset Splits No The paper states: 'The total size of MARIO-10M is 10,061,720, from which we randomly chose 10,000,000 samples as the training set and 61,720 as the testing set.' While a training and testing set are defined, there is no explicit mention or size given for a dedicated validation set.
Hardware Specification Yes We set the batch size to 768 and trained the model for two epochs, taking four days using 8 Tesla V100 GPUs with 32GB memory.
Software Dependencies No The paper mentions software like 'Hugging Face Diffusers' and 'xformers' and specific model checkpoints like 'runwayml/stable-diffusion-v1-5', but it does not provide specific version numbers for these software dependencies (e.g., 'Hugging Face Diffusers vX.Y.Z').
Experiment Setup Yes For the first stage, we utilize the pre-trained CLIP [60] to obtain the embedding of given prompts. The number of Transformer layers l is set to 2, and the dimension of latent space d is set to 512. The maximum length of tokens L is set to 77 following CLIP [60]. We leverage a commonly used font Arial.ttf and set the font size to 24 to obtain the width embedding and also use this font for rendering. The alphabet A comprises 95 characters, including 26 uppercase letters, 26 lowercase letters, 10 digits, 32 punctuation marks, and a space character. After tokenization, only the first subtoken is marked as the keyword when several subtokens exist for a word. For the second stage, we implement the diffusion process using Hugging Face Diffusers [82] and load the checkpoint runwayml/stable-diffusion-v1-5 . Notably, we only need to modify the input dimension of the input convolution layer (from 4 to 17), allowing our model to have a similar scale of parameters and computational time as the original model. In detail, the height H and W of input and output images are 512. For the diffusion process, the input is with spatial dimension H = 64 and W = 64. We set the batch size to 768 and trained the model for two epochs, taking four days using 8 Tesla V100 GPUs with 32GB memory. We use the Adam W optimizer [47] and set the learning rate to 1e-5. Additionally, we utilize gradient checkpoint [10] and xformers [39] for computational efficiency. During training, we follow [25] to set the maximum time step Tmax to 1,000, and the caption is dropped with a probability of 10% for classifier-free guidance [27]. When training the part-image generation branch, the detected text box is masked with a likelihood of 50%. We use 50 sampling steps during inference and classifier-free guidance with a scale of 7.5 following [67].