Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Rendering-Aware Reinforcement Learning for Vector Graphics Generation
Authors: Juan Rodriguez, Haotian Zhang, Abhay Puri, Rishav Pramanik, Aarash Feizi, Pascal Wichmann, Arnab Mondal, Mohammad R. Samsami, Rabiul Awal, Perouz Taslakian, Spandana Gella, Sai Rajeswar Mudumba, David Vazquez, Chris Pal, Marco Pedersoli
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | The main paper focuses primarily on the Im2SVG experiments, as this setting offers a well-defined and visually grounded framework for evaluating the SVG performance gains achieved with RLRF. |
| Researcher Affiliation | Collaboration | 1Service Now Research 2Mila 3ÉTS Montréal 4Polytechnique Montréal 5Columbia University 6Independent Scholar 7Stony Brook University 8Apple 9Google Research 10Canada CIFAR AI Chair 11Mc Gill University |
| Pseudocode | No | The paper describes methods and equations for training and rewards but does not include any explicit pseudocode or algorithm blocks. |
| Open Source Code | No | We are not submitting the data, code, or model weights as part of this submission. However, we are actively working towards open-sourcing all relevant materials, including the dataset, codebase, and pretrained model weights. We plan to release these resources publicly at the time of the camera-ready version to support full reproducibility of our main experimental results. |
| Open Datasets | Yes | We fine-tune Qwen2.5-VL models (3B and 7B) on the Im2SVG task using a cleaned subset of 1.7M image-SVG pairs from the SVG-Stack dataset [Rodriguez et al., 2025b], resulting in the SVG-SFT models. For Text2SVG, we use Qwen3-8B, a text-only model, and train it using image caption datasets (Flickr30k and MM-Icons), using only the captions as inputs (no SVG supervision). The model is prompted to <think> before generating SVG code. We train on a single node with 8 A100 GPUs for 4 days, using a rollout size of 16 and a batch size of 32 per step, for 1000 steps, corresponding to 16k unique captions. Across all RLRF experiments, we use a learning rate of 1e 5 with 70% decay every 100 steps. KL regularization is disabled (KL coefficient = 0), with a clipping threshold ϵ = 0.4 and sampling temperature set to 1.1. For Text2SVG evaluation, we use the MM-Icon and MM-Illustration [Yang et al., 2025b], as well as Flickr30k Captions [Young et al., 2014]. |
| Dataset Splits | Yes | We fine-tune Qwen2.5-VL models (3B and 7B) on the Im2SVG task using a cleaned subset of 1.7M image-SVG pairs from the SVG-Stack dataset [Rodriguez et al., 2025b]... We begin by filtering the SVG-Stack dataset to select 20k high-entropy samples that are rich in visual detail and SVG complexity (each with over 500 tokens). Details of this data curation process are provided in Appendix B.2. During training, we use the GRPO algorithm with a rollout batch size of 32 images per step. For each image, 64 rollouts are generated, resulting in 2,048 rollouts per training step. We train for 500 steps in total, covering 16k unique images, significantly fewer than the 1.7M samples used in SVG-SFT. Training was completed in approximately 3 days using 4 nodes, each with 8 H100 GPUs. For Text2SVG, we use Qwen3-8B, a text-only model, and train it using image caption datasets (Flickr30k and MM-Icons), using only the captions as inputs (no SVG supervision). The model is prompted to <think> before generating SVG code. We train on a single node with 8 A100 GPUs for 4 days, using a rollout size of 16 and a batch size of 32 per step, for 1000 steps, corresponding to 16k unique captions. Across all RLRF experiments, we use a learning rate of 1e 5 with 70% decay every 100 steps. KL regularization is disabled (KL coefficient = 0), with a clipping threshold ϵ = 0.4 and sampling temperature set to 1.1. For Im2SVG, we report results on SVG-Stack-Hard, a curated subset of 500 visually complex and diverse SVGs selected from the original SVG-Stack (see Appendix B.2 for more details on datasets). |
| Hardware Specification | Yes | Training runs used 4 8 H100 GPUs (3B model) or 8 8 H100 GPUs (7B model) for 4 days over 3 epochs, with learning rate 1e 5, batch size 1024, and context length 32k tokens. Although Qwen2.5-VL supports up to 128k tokens, we limit it to 32k to fit 90% of the data given memory constraints. Reinforcement Learning from Rendering Feedback (RLRF) on SVGs For the Im2SVG task, we further post-train the Qwen2.5-VL models, as well as Star Vector-1B (which can be viewed as an SVG-SFT model), using RLRF. We begin by filtering the SVG-Stack dataset to select 20k high-entropy samples that are rich in visual detail and SVG complexity (each with over 500 tokens). Details of this data curation process are provided in Appendix B.2. During training, we use the GRPO algorithm with a rollout batch size of 32 images per step. For each image, 64 rollouts are generated, resulting in 2,048 rollouts per training step. We train for 500 steps in total, covering 16k unique images, significantly fewer than the 1.7M samples used in SVG-SFT. Training was completed in approximately 3 days using 4 nodes, each with 8 H100 GPUs. For Text2SVG, we use Qwen3-8B, a text-only model, and train it using image caption datasets (Flickr30k and MM-Icons), using only the captions as inputs (no SVG supervision). The model is prompted to <think> before generating SVG code. We train on a single node with 8 A100 GPUs for 4 days, using a rollout size of 16 and a batch size of 32 per step, for 1000 steps, corresponding to 16k unique captions. |
| Software Dependencies | No | We use the LLa MA-Factory codebase [Zheng et al., 2024] to conduct our supervised fine-tuning (SFT) experiments. For reinforcement learning, including our GRPO-based approach, we build on Easy R1 [Zheng et al., 2025] and VERL [Sheng et al., 2024]. We leverage v LLM [Kwon et al., 2023] for sampling during rollout generation, as it enables highly optimized decoding with high throughput and low latency. This is particularly important for SVG generation, which involves long context sequences. |
| Experiment Setup | Yes | Training runs used 4 8 H100 GPUs (3B model) or 8 8 H100 GPUs (7B model) for 4 days over 3 epochs, with learning rate 1e 5, batch size 1024, and context length 32k tokens. ... Across all RLRF experiments, we use a learning rate of 1e 5 with 70% decay every 100 steps. KL regularization is disabled (KL coefficient = 0), with a clipping threshold ϵ = 0.4 and sampling temperature set to 1.1. |