CLIPVG: Text-Guided Image Manipulation Using Differentiable Vector Graphics

Authors: Yiren Song, Xuning Shao, Kang Chen, Weidong Zhang, Zhongliang Jing, Minzhe Li

AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments Experiment Setup Implementation. The multi-round vectorization strategy of CLIPVG requires an arbitrary vectorization tool, e.g., AIT (Ado 2022), Diffvg (Li et al. 2020b), LIVE (Ma et al. 2022), etc. We use AIT (Ado 2022) as the default tool, since it gives the most accurate reconstruction results in our experiments. We adopt two rounds of vectorization by default. The first round is done by Nc = 10, and the second round is done by Nc = 30, where Nc is the number of target colors in AIT. We add another round of vectorization for the area of human face with Nc = 30. We apply random cropping to obtain Npatch = 64 patches from each ROI. The patches are randomly cropped in each iteration. The default CLIP loss weight is 30.0 for a text prompt associated ROI, and is 80.0/Npatch for each randomly cropped patch. The patch size is always set to 80% the longer edge of the ROI region, e.g., 400 400 for a 500 300 ROI, and zero-padding is adopted when necessary. Similar to CLIPstyler (Kwon and Ye 2021), we also apply the random perspective augmentation to the patches. Similar to (Kwon and Ye 2021; Patashnik et al. 2021), we use the Vi T-B/32 CLIP model (Radford et al. 2021). We employ the Adam (Kingma and Ba 2014) optimizer with a learning rate of 0.2 for the shape parameters, and 0.01 for the color parameters by default. The number of iterations is set to 150. The running time information is included in the supplementary material.
Researcher Affiliation Collaboration 1Shanghai Jiao Tong University, Shanghai, China, 2Netease Games AI Lab, Hangzhou, China
Pseudocode No No pseudocode or algorithm blocks are included in the paper.
Open Source Code Yes We implement a flexible text-guided image manipulation system that supports a variety of controls far beyond the ability of all existing methods, and the source code of this system will be made publicly available.
Open Datasets Yes trained on Image Net (Deng et al. 2009).
Dataset Splits No We use images with a resolution of 512 512 as the inputs. (Explanation: The paper does not specify exact percentages, absolute sample counts, or reference predefined splits for training, validation, and test sets. It mentions image resolution and patch sizes, but not how the overall dataset was split.)
Hardware Specification No No specific hardware details (e.g., GPU/CPU models, memory) used for running experiments are provided in the paper.
Software Dependencies Yes We consider two domain-agnostic baselines, Disco Diffusion v5.6 (Dis 2022) and CLIPstyler (Kwon and Ye 2021). We employ the Adam (Kingma and Ba 2014) optimizer with a learning rate of 0.2 for the shape parameters, and 0.01 for the color parameters by default. The number of iterations is set to 150.
Experiment Setup Yes We apply random cropping to obtain Npatch = 64 patches from each ROI. The default CLIP loss weight is 30.0 for a text prompt associated ROI, and is 80.0/Npatch for each randomly cropped patch. The patch size is always set to 80% the longer edge of the ROI region, e.g., 400 400 for a 500 300 ROI, and zero-padding is adopted when necessary. We employ the Adam (Kingma and Ba 2014) optimizer with a learning rate of 0.2 for the shape parameters, and 0.01 for the color parameters by default. The number of iterations is set to 150.