StrokeNUWA—Tokenizing Strokes for Vector Graphic Synthesis

Authors: Zecheng Tang, Chenfei Wu, Zekai Zhang, Minheng Ni, Shengming Yin, Yu Liu, Zhengyuan Yang, Lijuan Wang, Zicheng Liu, Juntao Li, Nan Duan

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We compare Stroke NUWA with optimization-based methods in the text-guided Scalable Vector Graphic (SVG) generation task. Our approach achieves higher CLIPScore (Hessel et al., 2021) metrics, suggesting that utilizing stroke tokens can yield content with richer visual semantics. When benchmarked against LLM-based baselines, our method surpasses them across all metrics, indicating that stroke tokens can integrate effectively with LLMs. Besides, Stroke NUWA achieves up to a 94 speedup in inference over the speed of prior methods with an exceptional SVG code compression ratio of 6.9%. Our code is available at https://github.com/Project NUWA/ Stroke NUWA.
Researcher Affiliation Collaboration 1Soochow University 2Microsoft Research Asia 3Microsoft Azure AI.
Pseudocode No The paper describes the methodologies in prose and uses diagrams (e.g., Figure 3, Figure 4) to illustrate architecture, but no formal pseudocode or algorithm block is present.
Open Source Code Yes Our code is available at https://github.com/Project NUWA/ Stroke NUWA.
Open Datasets Yes We construct the training and evaluation data with FIGR-8-SVG dataset (Clouˆatre & Demers, 2019), which consists of massive monochromatic (black-and-white) SVG icons.
Dataset Splits Yes After preprocessing, we sample 2,000 instances with varying SVG code lengths as the testing set, 8,000 samples for validation, and apply the remaining 740K samples for training.
Hardware Specification Yes We test with LIVE (Ma et al., 2022) and Vector Fusion (Jain et al., 2023) on one NVIDIA V100 GPU. ... We utilize Deep Speed Library (Rajbhandari et al., 2020) to implement models on 64 NVIDIA V100 GPUs and set the maximum model length as 512.
Software Dependencies No The paper mentions "Deep Speed Library (Rajbhandari et al., 2020)" and that they "utilize the 3B Flan-T5 model (Chung et al., 2022) as the backbone" and "use T5 tokenizer". However, it does not provide specific version numbers for these software components (e.g., Deep Speed version, Flan-T5 version or PyTorch version).
Experiment Setup Yes Implementation Details For VQ-Stroke, we set the depth of the residual vector quantization d to 2, corresponding to compression rates of 2 and 4. Then, we set the codebook size | B | as 4096, with each code corresponding to a latent representation of 512 dimensions. We set α = 1 in Equ. 3 during the training process. For EDM, we utilize the 3B Flan-T5 model (Chung et al., 2022) as the backbone. We utilize Deep Speed Library (Rajbhandari et al., 2020) to implement models on 64 NVIDIA V100 GPUs and set the maximum model length as 512.