StrokeNUWA—Tokenizing Strokes for Vector Graphic Synthesis
Authors: Zecheng Tang, Chenfei Wu, Zekai Zhang, Minheng Ni, Shengming Yin, Yu Liu, Zhengyuan Yang, Lijuan Wang, Zicheng Liu, Juntao Li, Nan Duan
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We compare Stroke NUWA with optimization-based methods in the text-guided Scalable Vector Graphic (SVG) generation task. Our approach achieves higher CLIPScore (Hessel et al., 2021) metrics, suggesting that utilizing stroke tokens can yield content with richer visual semantics. When benchmarked against LLM-based baselines, our method surpasses them across all metrics, indicating that stroke tokens can integrate effectively with LLMs. Besides, Stroke NUWA achieves up to a 94 speedup in inference over the speed of prior methods with an exceptional SVG code compression ratio of 6.9%. Our code is available at https://github.com/Project NUWA/ Stroke NUWA. |
| Researcher Affiliation | Collaboration | 1Soochow University 2Microsoft Research Asia 3Microsoft Azure AI. |
| Pseudocode | No | The paper describes the methodologies in prose and uses diagrams (e.g., Figure 3, Figure 4) to illustrate architecture, but no formal pseudocode or algorithm block is present. |
| Open Source Code | Yes | Our code is available at https://github.com/Project NUWA/ Stroke NUWA. |
| Open Datasets | Yes | We construct the training and evaluation data with FIGR-8-SVG dataset (Clouˆatre & Demers, 2019), which consists of massive monochromatic (black-and-white) SVG icons. |
| Dataset Splits | Yes | After preprocessing, we sample 2,000 instances with varying SVG code lengths as the testing set, 8,000 samples for validation, and apply the remaining 740K samples for training. |
| Hardware Specification | Yes | We test with LIVE (Ma et al., 2022) and Vector Fusion (Jain et al., 2023) on one NVIDIA V100 GPU. ... We utilize Deep Speed Library (Rajbhandari et al., 2020) to implement models on 64 NVIDIA V100 GPUs and set the maximum model length as 512. |
| Software Dependencies | No | The paper mentions "Deep Speed Library (Rajbhandari et al., 2020)" and that they "utilize the 3B Flan-T5 model (Chung et al., 2022) as the backbone" and "use T5 tokenizer". However, it does not provide specific version numbers for these software components (e.g., Deep Speed version, Flan-T5 version or PyTorch version). |
| Experiment Setup | Yes | Implementation Details For VQ-Stroke, we set the depth of the residual vector quantization d to 2, corresponding to compression rates of 2 and 4. Then, we set the codebook size | B | as 4096, with each code corresponding to a latent representation of 512 dimensions. We set α = 1 in Equ. 3 during the training process. For EDM, we utilize the 3B Flan-T5 model (Chung et al., 2022) as the backbone. We utilize Deep Speed Library (Rajbhandari et al., 2020) to implement models on 64 NVIDIA V100 GPUs and set the maximum model length as 512. |