Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Flex-Judge: Text-Only Reasoning Unleashes Zero-Shot Multimodal Evaluators

Authors: Jongwoo Ko, Sungnyun Kim, Sungwoo Cho, Se-Young Yun

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirical results demonstrate that FLEX-Judge, despite being trained on significantly fewer text data, achieves competitive or superior performance compared to state-of-the-art commercial APIs and extensively trained multimodal evaluators. Notably, FLEX-Judge presents broad impact in modalities like molecule, where comprehensive evaluation benchmarks are scarce, underscoring its practical value in resource-constrained domains.
Researcher Affiliation Collaboration Jongwoo Ko1 Sungnyun Kim2 Sungwoo Cho2 Se-Young Yun2 1Microsoft 2KAIST AI EMAIL, EMAIL
Pseudocode No The paper does not contain any sections or figures explicitly labeled as "Pseudocode" or "Algorithm", nor does it present structured code blocks.
Open Source Code Yes We have added our source code and dataset in the supplementary material. We also plan to make all the assets available upon acceptance.
Open Datasets Yes We comprehensively evaluate FLEX-Judge across diverse modalities, including images, videos, and audio, demonstrating its generalization capability and competitive performance against state-of-the-art judge models. Notably, it matches closed-source commercial APIs on vision tasks and outperforms all training-free evaluators in audio understanding. These strong results suggest that FLEX-Judge can be used with confidence in modalities even when expert judge models are not applicable, which we further explore in Section 4. Evaluation Benchmarks. We evaluate image understanding capabilities of FLEX-Judge using the MLLM-as-a-Judge benchmark [11], which comprises 14 diverse vision-language tasks including captioning and website browsing, and the VL-Reward Bench benchmark1 [38], which focuses on complex reasoning tasks like visual hallucination detection. For image generation assessment, we use MJ-Bench [14] to assess image quality and alignment. We use Gen AI-Bench [35] for evaluating video generation and image editing. For audio understanding, following the prior work [68], we conduct speech quality assessment task, specifically performing mean opinion score (MOS) prediction by using the NISQA [49], BVCC [18], and SOMOS [46] datasets, and speaker similarity score (SS) prediction with Vox Sim [2] dataset for assessing speaker similarity score (SS). For additional results of Multimodal Reward Bench [82] and Judge Anything [54] benchmarks, refer to Appendix C.1. We also provide the language-only assessment results in Appendix C.4.
Dataset Splits Yes Audio MOS/SS Benchmark: There is no unified, structured benchmark for audio evaluation tasks. Instead, Wang et al. [68] assessed speech quality and speaker similarity using four datasets: NISQA [49], BVCC [18], and SOMOS [46] for speech quality (712, 742, and 3,000 test samples, respectively), and Vox Sim [2] for speaker similarity (2,776 test pairs). All datasets include human-annotated scores. For the MOS prediction task, auditory LLMs are asked to assign the MOS score on a scale from 1.0 to 5.0 for a given speech input. For each query x, we sample two responses from Mol-LLa MA using different decoding temperatures, 0.8 and 1.2. We use FLEX-Mol-LLa MA to compare the two and include the example in the DPO training set only if the response from temperature 0.8 receives a higher score than the one from 1.2. Also, since there is a position bias [66] in pairwise comparisons, we flip the order of the two responses in prompt and evaluate again, retaining only those pairs where the winning response remains consistent. Consequently, we construct 4,253 high-quality preference triplets (x, yw, yl).
Hardware Specification Yes Training is conducted on 2 NVIDIA A6000 GPUs, taking approximately 1.5 hours per run, which highlights cost-efficiency of our FLEX-Judge. For FLEX-Mol-LLa MA, we use the same hyperparameters as for FLEX-VL-7B.
Software Dependencies No The paper mentions specific models like "Qwen2.5-Omni-7B" and "Qwen2.5-VL-7B" as backbones but does not provide explicit version numbers for other ancillary software or libraries (e.g., Python, PyTorch, CUDA versions).
Experiment Setup Yes Using a 1K-sized training dataset, we fine-tune Qwen2.5-VL-7B and Qwen2.5-Omni-7B with learning rates of 1e-5 and 7e-6, respectively. For both models, we use a batch size of 2 and a maximum sequence length of 4096 for a single epoch. Training is conducted on 2 NVIDIA A6000 GPUs, taking approximately 1.5 hours per run, which highlights cost-efficiency of our FLEX-Judge. For FLEX-Mol-LLa MA, we use the same hyperparameters as for FLEX-VL-7B.