Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
CODE: Contrasting Self-generated Description to Combat Hallucination in Large Multi-modal Models
Authors: Junho Kim, Hyunjun Kim, Kim Yeonju, Yong Man Ro
NeurIPS 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments demonstrate that our method significantly reduces hallucinations and improves cross-modal consistency across various benchmarks and cutting-edge LMMs. |
| Researcher Affiliation | Academia | Junho Kim Hyun Jun Kim Yeon Ju Kim Yong Man Ro Integrated Vision and Language Lab, KAIST, South Korea EMAIL |
| Pseudocode | Yes | Algorithm 1 COuntering DEscription Contrastive Decoding |
| Open Source Code | Yes | Code is available at https://ivy-lvlm.github.io/CODE |
| Open Datasets | Yes | As discriminative benchmarks, we utilize mainly three datasets for detailed evaluation. Specifically, POPE [35] is a commonly used benchmark... MMVP [53] aims to evaluate... Realworld QA [57] is the most recent dataset... We use three benchmarks for generative benchmarks... LLaVA-QA90 [42] and LLaVA-Bench (In-the-Wild) [42]... MMHal-Bench [52] evaluates... |
| Dataset Splits | Yes | We implement our method on contemporary LMMs: LLaVA-1.5 (13B) [40], Emu2-Chat (14B) [51], Intern LM-XComposer2 (7B) [16], LLaVA-NeXT (34B) [41], Yi-VL (34B) [62], and Intern VL 1.5 (26B) [8]. |
| Hardware Specification | Yes | We compare three different model sizes. Throughput (token/s) Latency (ms/token) VCD OPERA CODE VCD OPERA CODE 7B [16] 5.62 1.23 3.66 177.99 809.73 272.92 14B [51] 4.04 1.04 2.82 247.6 960.14 354.09 34B [41] 3.61 oom 2.81 277.27 oom 355.81 We compare three different model sizes on 8 NVIDIA RTX A6000 GPUs as in Table. 4. |
| Software Dependencies | No | The paper mentions various LMMs (e.g., LLaVA-1.5, Emu2-Chat, Intern LM-XComposer2), but does not specify version numbers for general software dependencies like Python, PyTorch, or CUDA. |
| Experiment Setup | Yes | We used the default parameter settings for all methods, where top-p value 0.95 and temperature 1.0 for Nucleus sampling, the number of window size for searching is 5 (i.e., num-beams 5) for both beam search decoding and OPERA, and CD-α = 1, CD-β = 0.1 for VCD, and k = 0.3 for our method. |