How Control Information Influences Multilingual Text Image Generation and Editing?

Authors: Boqiang Zhang, Zuan Gao, Yadong Qu, Hongtao Xie

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our method achieves state-of-the-art performance in both Chinese and English text generation.
Researcher Affiliation Academia Boqiang Zhang, Zuan Gao, Yadong Qu, Hongtao Xie University of Science and Technology of China {cyril,zuangao,qqqyd}@mail.ustc.edu.cn htxie@ustc.edu.cn
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code Yes The code and dataset are available at https://github.com/Cyril Sterling/Text Gen.
Open Datasets Yes Therefore, we introduce TG2M, a multilingual dataset sourced from publicly available images including MARIO-10M [5], Wukong [8], Text Seg [31], Ar T [6], LSVT [27], MLT [16], Re CTS [37]. The code and dataset are available at https://github.com/Cyril Sterling/Text Gen.
Dataset Splits No The paper does not provide specific dataset split information (e.g., percentages, sample counts, or detailed splitting methodology) for training, validation, and testing.
Hardware Specification Yes We train our model on the TG2M dataset using 8 NVIDIA A40 GPUs with a batch size of 176.
Software Dependencies No The paper mentions 'SD1.52' and 'diffusers3' (with a GitHub link for diffusers), but does not provide specific version numbers for key software dependencies such as PyTorch, Python, or CUDA for full reproducibility.
Experiment Setup Yes We train our model on the TG2M dataset using 8 NVIDIA A40 GPUs with a batch size of 176. Our model converges rapidly and requires only 5 epochs of training. The learning rate is set to 1e-5. During inference, the Fourier balance factors α, β, and s are set to 1.4, 1.2, and 0.2, respectively.