Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

CODE: Contrasting Self-generated Description to Combat Hallucination in Large Multi-modal Models

Authors: Junho Kim, Hyunjun Kim, Kim Yeonju, Yong Man Ro

NeurIPS 2024 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments demonstrate that our method significantly reduces hallucinations and improves cross-modal consistency across various benchmarks and cutting-edge LMMs.
Researcher Affiliation Academia Junho Kim Hyun Jun Kim Yeon Ju Kim Yong Man Ro Integrated Vision and Language Lab, KAIST, South Korea EMAIL
Pseudocode Yes Algorithm 1 COuntering DEscription Contrastive Decoding
Open Source Code Yes Code is available at https://ivy-lvlm.github.io/CODE
Open Datasets Yes As discriminative benchmarks, we utilize mainly three datasets for detailed evaluation. Specifically, POPE [35] is a commonly used benchmark... MMVP [53] aims to evaluate... Realworld QA [57] is the most recent dataset... We use three benchmarks for generative benchmarks... LLaVA-QA90 [42] and LLaVA-Bench (In-the-Wild) [42]... MMHal-Bench [52] evaluates...
Dataset Splits Yes We implement our method on contemporary LMMs: LLaVA-1.5 (13B) [40], Emu2-Chat (14B) [51], Intern LM-XComposer2 (7B) [16], LLaVA-NeXT (34B) [41], Yi-VL (34B) [62], and Intern VL 1.5 (26B) [8].
Hardware Specification Yes We compare three different model sizes. Throughput (token/s) Latency (ms/token) VCD OPERA CODE VCD OPERA CODE 7B [16] 5.62 1.23 3.66 177.99 809.73 272.92 14B [51] 4.04 1.04 2.82 247.6 960.14 354.09 34B [41] 3.61 oom 2.81 277.27 oom 355.81 We compare three different model sizes on 8 NVIDIA RTX A6000 GPUs as in Table. 4.
Software Dependencies No The paper mentions various LMMs (e.g., LLaVA-1.5, Emu2-Chat, Intern LM-XComposer2), but does not specify version numbers for general software dependencies like Python, PyTorch, or CUDA.
Experiment Setup Yes We used the default parameter settings for all methods, where top-p value 0.95 and temperature 1.0 for Nucleus sampling, the number of window size for searching is 5 (i.e., num-beams 5) for both beam search decoding and OPERA, and CD-α = 1, CD-β = 0.1 for VCD, and k = 0.3 for our method.