Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
JarvisArt: Liberating Human Artistic Creativity via an Intelligent Photo Retouching Agent
Authors: Yunlong Lin, Zixu Lin, Kunjie Lin, Jinbin Bai, Panwang Pan, Chenxin Li, Haoyu Chen, Zhongdao Wang, Xinghao Ding, Wenbo Li, Shuicheng Yan
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To evaluate performance, we develop MMArt-Bench, a novel benchmark constructed from real-world user edits. Jarvis Art demonstrates user-friendly interaction, superior generalization, and fine-grained control over both global and local adjustments, paving a new avenue for intelligent photo retouching. Notably, it outperforms GPT-4o with a 60% improvement in average pixel-level metrics on MMArt-Bench for content fidelity, while maintaining comparable instruction-following capabilities. 4 Experiment |
| Researcher Affiliation | Collaboration | 1 Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, Xiamen, Fujian, China 2 The Hong Kong University of Science and Technology (Guangzhou) 3 The Chinese University of Hong Kong 4 Bytedance 5 National University of Singapore 6 Tsinghua University |
| Pseudocode | No | The paper describes the workflow of Jarvis Art (Sec. 3.1) and the GRPO-R algorithm (Sec. 3.3.2 and Appendix B.1) in textual form, including mathematical formulas, but does not present any structured pseudocode or algorithm blocks with numbered or bulleted steps formatted like code. |
| Open Source Code | No | Code will be released upon acceptance of the paper. |
| Open Datasets | Yes | We source raw images from PPR10K [32], the Adobe Lightroom community, and licensed open-source collections. To assess the generalization ability of our system, we conduct comprehensive qualitative and visual comparisons on the MIT-Five K [3] benchmark dataset. |
| Dataset Splits | Yes | The Co T supervised fine-tuning phase is performed on 50K Co T-annotated instances from MMArt, with a batch size of 2, a learning rate of 1e-5, and training for 2 epochs using the Llama-Factory framework [74] on 8 A100 (80G) GPUs. The reinforcement learning phase, employing the GRPO-R algorithm, is conducted on 5K standard instruction samples from MMArt, using the ve RL framework [50]. For each training step, we sample a batch of 2, a learning rate of 1e-6, and generate 4 responses per query, training for 2 epochs on 16 A100 (80G) GPUs. MMArt-Bench. To provide a comprehensive evaluation of Jarvis Art s performance, we introduce the MMArt-Bench, which is sampled from the MMArt dataset. It includes four main scenarios: portrait, landscape, street scenes, and still life, with 50 instances per category, totaling 200 instances. Each primary category contains multiple subcategories (Appendix A.1). For region-level evaluation, we utilize a portrait subset comprising 50 human-centered images with mask annotations. |
| Hardware Specification | Yes | The Co T supervised fine-tuning phase is performed on 50K Co T-annotated instances from MMArt, with a batch size of 2, a learning rate of 1e-5, and training for 2 epochs using the Llama-Factory framework [74] on 8 A100 (80G) GPUs. The reinforcement learning phase, employing the GRPO-R algorithm, is conducted on 5K standard instruction samples from MMArt, using the ve RL framework [50]. For each training step, we sample a batch of 2, a learning rate of 1e-6, and generate 4 responses per query, training for 2 epochs on 16 A100 (80G) GPUs. |
| Software Dependencies | Yes | We adopt Qwen2.5-VL-7B-Instruct [1] as the base model for Jarvis Art. The Co T supervised fine-tuning phase is performed on 50K Co T-annotated instances from MMArt, with a batch size of 2, a learning rate of 1e-5, and training for 2 epochs using the Llama-Factory framework [74] on 8 A100 (80G) GPUs. The reinforcement learning phase, employing the GRPO-R algorithm, is conducted on 5K standard instruction samples from MMArt, using the ve RL framework [50]. |
| Experiment Setup | Yes | The Co T supervised fine-tuning phase is performed on 50K Co T-annotated instances from MMArt, with a batch size of 2, a learning rate of 1e-5, and training for 2 epochs using the Llama-Factory framework [74] on 8 A100 (80G) GPUs. The reinforcement learning phase, employing the GRPO-R algorithm, is conducted on 5K standard instruction samples from MMArt, using the ve RL framework [50]. For each training step, we sample a batch of 2, a learning rate of 1e-6, and generate 4 responses per query, training for 2 epochs on 16 A100 (80G) GPUs. To ensure reproducibility, we provide the complete hyperparameter settings for both the SFT and GRPO-R phases in Table 4. |