AnyText: Multilingual Visual Text Generation and Editing

Authors: Yuxiang Tuo, Wangmeng Xiang, Jun-Yan He, Yifeng Geng, Xuansong Xie

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental After conducting extensive evaluation experiments, our method has outperformed all other approaches by a significant margin. Additionally, we contribute the first large-scale multilingual text images dataset, Any Word-3M, containing 3 million image-text pairs with OCR annotations in multiple languages. Based on Any Word-3M dataset, we propose Any Text-benchmark for the evaluation of visual text generation accuracy and quality.
Researcher Affiliation Industry Yuxiang Tuo, Wangmeng Xiang, Jun-Yan He, Yifeng Geng , Xuansong Xie Institute for Intelligent Computing, Alibaba Group {yuxiang.tyx,wangmeng.xwm,leyuan.hjy,cangyu.gyf,xingtong.xxs}@alibaba-inc.com
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code Yes Our project will be open-sourced soon on https://github.com/tyxsspa/Any Text to improve and promote the development of text generation technology.
Open Datasets Yes Therefore, we propose Any Word-3M, a large-scale multilingual dataset from publicly available images. The sources of these images include Noah Wukong Gu et al. (2022), LAION-400M Schuhmann et al. (2021), as well as datasets used for OCR recognition tasks such as Ar T, COCO-Text, RCTW, LSVT, MLT, MTWI, Re CTS.
Dataset Splits Yes We randomly extracted 1000 images from both Wukong and LAION subsets to create the evaluation set called Any Text-benchmark. These two evaluation sets are specifically used to evaluate the accuracy and quality of Chinese and English generation, respectively. The remaining images are used as the training set called Any Word-3M
Hardware Specification Yes Our model was trained on the Any Word-3M dataset for 10 epochs using 8 Tesla A100 GPUs. ... We compared the computational overhead of both models using a batch size of 4 on a single Tesla V100, the inference time for Control Net is 3476 ms/image, and for Any Text is 3512 ms/image.
Software Dependencies No The paper mentions specific OCR models (PP-OCRv3, PP-OCRv4) and general frameworks (Control Net, SD1.5) but does not provide specific version numbers for broader software dependencies like Python, PyTorch, or CUDA.
Experiment Setup Yes Our model was trained on the Any Word-3M dataset for 10 epochs using 8 Tesla A100 GPUs. We employed a progressive finetuning strategy, where the editing branch was turned off for the first 5 epochs, then activated with a probability of σ = 0.5 for the next 3 epochs. In the last 2 epochs, we enabled the perceptual loss with a weight coefficient of λ = 0.01. Image dimensions of lg and lp are set to be 1024x1024 and 512x512, while eg, pg, and p g are all set to be 80x512. We use Adam W optimizer with a learning rate of 2e-5 and a batch size of 48. ... all methods employed the DDIM sampler with 20 steps of sampling, a CFG-scale of 9, a fixed random seed of 100, a batch size of 4