reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

SECOND: Mitigating Perceptual Hallucination in Vision-Language Models via Selective and Contrastive Decoding

Authors: Woohyeon Park, Woojin Kim, Jaeik Kim, Jaeyoung Do

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments demonstrate that SECOND outperforms several baselines across diverse benchmarks, including POPE (Li et al., 2023), VQAv2 (Antol et al., 2015), MMStar (Chen et al., 2024a), and MMBench (Liu et al., 2025), highlighting its effectiveness.
Researcher Affiliation	Academia	1Department of Electrical and Computer Engineering, Seoul National University, Seoul, South Korea 2Interdisciplinary Program in Artificial Intelligence, Seoul National University, Seoul, South Korea. Correspondence to: Jaeyoung Do <EMAIL>.
Pseudocode	Yes	D. Patch Selection Algorithm
Open Source Code	Yes	Code is available at https://github.com/AIDASLab/SECOND.
Open Datasets	Yes	Extensive experiments demonstrate that SECOND outperforms several baselines across diverse benchmarks, including POPE (Li et al., 2023), VQAv2 (Antol et al., 2015), MMStar (Chen et al., 2024a), and MMBench (Liu et al., 2025)
Dataset Splits	Yes	POPE (Li et al., 2023) is a widely adopted benchmark that specializes in identifying perceptual hallucination by querying the presence of specific objects in a given image through simple yes/no questions. It employs recall, accuracy, and f1 score as the primary evaluation metrics and includes 3k questions derived from well-known datasets such as MSCOCO (Lin et al., 2014), A-OKVQA (Schwenk et al., 2022), and GQA (Hudson & Manning, 2019). In this study, we evaluated the models using the popular split of the POPE benchmark. ... For the general tasks, VQAv2 (Antol et al., 2015) serves as a benchmark for evaluating VLMs ability to generate answers for given image-question pairs. ... We evaluate the lite version consisting of 0.5k questions ... MMStar comprises 1.5k questions, while MMBench s lite version includes 0.5k samples.
Hardware Specification	No	The paper does not provide specific hardware details used for running its experiments.
Software Dependencies	No	The paper does not provide specific ancillary software details with version numbers.
Experiment Setup	Yes	For the several hyperparameters in SECOND, we serve the optimal settings in Appendix C, further analyzing the hyperparameter sensitivity in Sec. 5.5. ... Table 6. Optimal settings of patch selection hyperparameter λ. ... Table 7. Optimal settings of multi-stage CD hyperparameters α, β, and γ.