Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

SECOND: Mitigating Perceptual Hallucination in Vision-Language Models via Selective and Contrastive Decoding

Authors: Woohyeon Park, Woojin Kim, Jaeik Kim, Jaeyoung Do

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments demonstrate that SECOND outperforms several baselines across diverse benchmarks, including POPE (Li et al., 2023), VQAv2 (Antol et al., 2015), MMStar (Chen et al., 2024a), and MMBench (Liu et al., 2025), highlighting its effectiveness.
Researcher Affiliation Academia 1Department of Electrical and Computer Engineering, Seoul National University, Seoul, South Korea 2Interdisciplinary Program in Artificial Intelligence, Seoul National University, Seoul, South Korea. Correspondence to: Jaeyoung Do <EMAIL>.
Pseudocode Yes D. Patch Selection Algorithm
Open Source Code Yes Code is available at https://github.com/AIDASLab/SECOND.
Open Datasets Yes Extensive experiments demonstrate that SECOND outperforms several baselines across diverse benchmarks, including POPE (Li et al., 2023), VQAv2 (Antol et al., 2015), MMStar (Chen et al., 2024a), and MMBench (Liu et al., 2025)
Dataset Splits Yes POPE (Li et al., 2023) is a widely adopted benchmark that specializes in identifying perceptual hallucination by querying the presence of specific objects in a given image through simple yes/no questions. It employs recall, accuracy, and f1 score as the primary evaluation metrics and includes 3k questions derived from well-known datasets such as MSCOCO (Lin et al., 2014), A-OKVQA (Schwenk et al., 2022), and GQA (Hudson & Manning, 2019). In this study, we evaluated the models using the popular split of the POPE benchmark. ... For the general tasks, VQAv2 (Antol et al., 2015) serves as a benchmark for evaluating VLMs ability to generate answers for given image-question pairs. ... We evaluate the lite version consisting of 0.5k questions ... MMStar comprises 1.5k questions, while MMBench s lite version includes 0.5k samples.
Hardware Specification No The paper does not provide specific hardware details used for running its experiments.
Software Dependencies No The paper does not provide specific ancillary software details with version numbers.
Experiment Setup Yes For the several hyperparameters in SECOND, we serve the optimal settings in Appendix C, further analyzing the hyperparameter sensitivity in Sec. 5.5. ... Table 6. Optimal settings of patch selection hyperparameter λ. ... Table 7. Optimal settings of multi-stage CD hyperparameters α, β, and γ.