Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Q-Bench: A Benchmark for General-Purpose Foundation Models on Low-level Vision
Authors: Haoning Wu, Zicheng Zhang, Erli Zhang, Chaofeng Chen, Liang Liao, Annan Wang, Chunyi Li, Wenxiu Sun, Qiong Yan, Guangtao Zhai, Weisi Lin
ICLR 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To address this gap, we present Q-Bench, a holistic benchmark crafted to systematically evaluate potential abilities of MLLMs on three realms: low-level visual perception, low-level visual description, and overall visual quality assessment.Our evaluation across the three abilities confirms that MLLMs possess preliminary low-level visual skills. |
| Researcher Affiliation | Collaboration | 1S-Lab, Nanyang Technological University, 2Shanghai Jiaotong University, 3Sensetime |
| Pseudocode | Yes | Algorithm 1 Pytorch-style Pseudo Code for Softmax-based Strategy for IQA with MLLMs |
| Open Source Code | Yes | Project Page: https://q-future.github.io/Q-Bench. |
| Open Datasets | Yes | For the assessment ability (A3), we utilize plenty of existing IQA databases (Hosu et al., 2020; Lin et al., 2019; Li et al., 2023c) that focus on various low-level appearances of images, to benchmark MLLMs within conventional IQA settings. |
| Dataset Splits | Yes | For a holistic examination on the perception ability of MLLMs, we evaluate the multi-choice correctness of MLLMs on different sub-categories of the LLVision dataset, which is equally divided as dev (Tab. 7, will be released) and test (Tab. 2, will keep private) subsets. |
| Hardware Specification | No | No specific hardware details (e.g., GPU/CPU models, memory) used for running experiments were provided. |
| Software Dependencies | No | No specific software dependencies with version numbers (e.g., library names with versions) were explicitly provided. |
| Experiment Setup | Yes | Under this principle, we conduct toy experiments on LLVision QA on Shikra and LLa VA-v1, with two simple instruction strategies: (A) Direct Instruction, in which the prompt is designed as simple as Rate the quality of the image . (B) Numerical Instruction, in which we specifically instruct numerical ratings, with the prompt: Score the quality of the image from 1 to 5, with 1 as lowest and 5 as highest. . |