Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

JarvisArt: Liberating Human Artistic Creativity via an Intelligent Photo Retouching Agent

Authors: Yunlong Lin, Zixu Lin, Kunjie Lin, Jinbin Bai, Panwang Pan, Chenxin Li, Haoyu Chen, Zhongdao Wang, Xinghao Ding, Wenbo Li, Shuicheng Yan

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To evaluate performance, we develop MMArt-Bench, a novel benchmark constructed from real-world user edits. Jarvis Art demonstrates user-friendly interaction, superior generalization, and fine-grained control over both global and local adjustments, paving a new avenue for intelligent photo retouching. Notably, it outperforms GPT-4o with a 60% improvement in average pixel-level metrics on MMArt-Bench for content fidelity, while maintaining comparable instruction-following capabilities. 4 Experiment
Researcher Affiliation	Collaboration	1 Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, Xiamen, Fujian, China 2 The Hong Kong University of Science and Technology (Guangzhou) 3 The Chinese University of Hong Kong 4 Bytedance 5 National University of Singapore 6 Tsinghua University
Pseudocode	No	The paper describes the workflow of Jarvis Art (Sec. 3.1) and the GRPO-R algorithm (Sec. 3.3.2 and Appendix B.1) in textual form, including mathematical formulas, but does not present any structured pseudocode or algorithm blocks with numbered or bulleted steps formatted like code.
Open Source Code	No	Code will be released upon acceptance of the paper.
Open Datasets	Yes	We source raw images from PPR10K [32], the Adobe Lightroom community, and licensed open-source collections. To assess the generalization ability of our system, we conduct comprehensive qualitative and visual comparisons on the MIT-Five K [3] benchmark dataset.
Dataset Splits	Yes	The Co T supervised fine-tuning phase is performed on 50K Co T-annotated instances from MMArt, with a batch size of 2, a learning rate of 1e-5, and training for 2 epochs using the Llama-Factory framework [74] on 8 A100 (80G) GPUs. The reinforcement learning phase, employing the GRPO-R algorithm, is conducted on 5K standard instruction samples from MMArt, using the ve RL framework [50]. For each training step, we sample a batch of 2, a learning rate of 1e-6, and generate 4 responses per query, training for 2 epochs on 16 A100 (80G) GPUs. MMArt-Bench. To provide a comprehensive evaluation of Jarvis Art s performance, we introduce the MMArt-Bench, which is sampled from the MMArt dataset. It includes four main scenarios: portrait, landscape, street scenes, and still life, with 50 instances per category, totaling 200 instances. Each primary category contains multiple subcategories (Appendix A.1). For region-level evaluation, we utilize a portrait subset comprising 50 human-centered images with mask annotations.
Hardware Specification	Yes	The Co T supervised fine-tuning phase is performed on 50K Co T-annotated instances from MMArt, with a batch size of 2, a learning rate of 1e-5, and training for 2 epochs using the Llama-Factory framework [74] on 8 A100 (80G) GPUs. The reinforcement learning phase, employing the GRPO-R algorithm, is conducted on 5K standard instruction samples from MMArt, using the ve RL framework [50]. For each training step, we sample a batch of 2, a learning rate of 1e-6, and generate 4 responses per query, training for 2 epochs on 16 A100 (80G) GPUs.
Software Dependencies	Yes	We adopt Qwen2.5-VL-7B-Instruct [1] as the base model for Jarvis Art. The Co T supervised fine-tuning phase is performed on 50K Co T-annotated instances from MMArt, with a batch size of 2, a learning rate of 1e-5, and training for 2 epochs using the Llama-Factory framework [74] on 8 A100 (80G) GPUs. The reinforcement learning phase, employing the GRPO-R algorithm, is conducted on 5K standard instruction samples from MMArt, using the ve RL framework [50].
Experiment Setup	Yes	The Co T supervised fine-tuning phase is performed on 50K Co T-annotated instances from MMArt, with a batch size of 2, a learning rate of 1e-5, and training for 2 epochs using the Llama-Factory framework [74] on 8 A100 (80G) GPUs. The reinforcement learning phase, employing the GRPO-R algorithm, is conducted on 5K standard instruction samples from MMArt, using the ve RL framework [50]. For each training step, we sample a batch of 2, a learning rate of 1e-6, and generate 4 responses per query, training for 2 epochs on 16 A100 (80G) GPUs. To ensure reproducibility, we provide the complete hyperparameter settings for both the SFT and GRPO-R phases in Table 4.