CODE: Contrasting Self-generated Description to Combat Hallucination in Large Multi-modal Models

Authors: Junho Kim, Hyunjun Kim, Kim Yeonju, Yong Man Ro

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments demonstrate that our method significantly reduces hallucinations and improves cross-modal consistency across various benchmarks and cutting-edge LMMs.
Researcher Affiliation Academia Junho Kim Hyun Jun Kim Yeon Ju Kim Yong Man Ro Integrated Vision and Language Lab, KAIST, South Korea {arkimjh, kimhj709, yeonju7.kim, ymro}@kaist.ac.kr
Pseudocode Yes Algorithm 1 COuntering DEscription Contrastive Decoding
Open Source Code Yes Code is available at https://ivy-lvlm.github.io/CODE
Open Datasets Yes As discriminative benchmarks, we utilize mainly three datasets for detailed evaluation. Specifically, POPE [35] is a commonly used benchmark... MMVP [53] aims to evaluate... Realworld QA [57] is the most recent dataset... We use three benchmarks for generative benchmarks... LLaVA-QA90 [42] and LLaVA-Bench (In-the-Wild) [42]... MMHal-Bench [52] evaluates...
Dataset Splits Yes We implement our method on contemporary LMMs: LLaVA-1.5 (13B) [40], Emu2-Chat (14B) [51], Intern LM-XComposer2 (7B) [16], LLaVA-NeXT (34B) [41], Yi-VL (34B) [62], and Intern VL 1.5 (26B) [8].
Hardware Specification Yes We compare three different model sizes. Throughput (token/s) Latency (ms/token) VCD OPERA CODE VCD OPERA CODE 7B [16] 5.62 1.23 3.66 177.99 809.73 272.92 14B [51] 4.04 1.04 2.82 247.6 960.14 354.09 34B [41] 3.61 oom 2.81 277.27 oom 355.81 We compare three different model sizes on 8 NVIDIA RTX A6000 GPUs as in Table. 4.
Software Dependencies No The paper mentions various LMMs (e.g., LLaVA-1.5, Emu2-Chat, Intern LM-XComposer2), but does not specify version numbers for general software dependencies like Python, PyTorch, or CUDA.
Experiment Setup Yes We used the default parameter settings for all methods, where top-p value 0.95 and temperature 1.0 for Nucleus sampling, the number of window size for searching is 5 (i.e., num-beams 5) for both beam search decoding and OPERA, and CD-α = 1, CD-β = 0.1 for VCD, and k = 0.3 for our method.