Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Q-Insight: Understanding Image Quality via Visual Reinforcement Learning

Authors: Weiqi Li, Xuanyu Zhang, Shijie Zhao, Yabin ZHANG, Junlin Li, Li zhang, Jian Jun Zhang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments demonstrate that Q-Insight substantially outperforms existing state-of-the-art methods on both score regression and degradation perception tasks, while exhibiting impressive zero-shot generalization and superior comparison reasoning capability. The paper includes a dedicated "4 Experiments" section, further broken down into "4.2 Score Regression," "4.3 Distortion Perception," and "4.4 Ablation Studies," all of which present empirical results and comparisons.
Researcher Affiliation	Collaboration	1 School of Electronic and Computer Engineering, Peking University 2 Byte Dance Inc. Project Lead. B: Corresponding authors, EMAIL, EMAIL.
Pseudocode	No	The paper describes the methodology of Q-Insight, including the Group Relative Policy Optimization (GRPO) framework, using textual explanations and mathematical equations in Section 3.1 and 3.2, complemented by an overview diagram in Figure 2. However, it does not contain any explicit pseudocode or algorithm blocks.
Open Source Code	Yes	The code and models are available at https://github.com/bytedance/Q-Insight.
Open Datasets	Yes	For the score regression task, we use diverse IQA datasets across four categories: (a) in-the-wild datasets, including Kon IQ [16], SPAQ [11], and LIVE-Wild [12]; (b) synthetic distortion datasets, including KADID [22] and CSIQ [20]; (c) model-processed distortions, including PIPAL [14]; and (d) AI-generated images from AGIQA [21]. ... For degradation perception task, we randomly select 7000 images from DQ-495K [59]... For training, we use the Diff IQA [6] dataset...
Dataset Splits	Yes	Following [58], we split Kon IQ into training and test sets, with approximately 7000 training images. ... For degradation perception task, we randomly select 7000 images from DQ-495K [59] that contain a single distortion for training, with an additional 1000 images reserved for testing. ... Specifically, we randomly sample 5k data pairs from the Diff IQA [6] dataset, where each pair is labeled only with comparison results, without any textual descriptions.
Hardware Specification	Yes	Training is completed in approximately one day using 16 NVIDIA A100 GPUs. ... Training is completed in approximately 20 hours using 16 NVIDIA A100 GPUs.
Software Dependencies	No	The paper states: "We employ Adam W [26] as the optimizer" but does not specify any software libraries or frameworks with their version numbers.
Experiment Setup	Yes	In the GRPO algorithm, the generation number N is set to 8, the weight of KL divergence penalty β is set to 1 10 3, while the weights α1 and α2 are set to 0.25 and 0.75, respectively. The threshold ϵ is set to 0.35. We employ Adam W [26] as the optimizer, using an initial learning rate of 1 10 6 that linearly decays to 1 10 9 during training. The model is trained for 10 epochs with a total batch size of 128.