CODE: Contrasting Self-generated Description to Combat Hallucination in Large Multi-modal Models
Authors: Junho Kim, Hyunjun Kim, Kim Yeonju, Yong Man Ro
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments demonstrate that our method significantly reduces hallucinations and improves cross-modal consistency across various benchmarks and cutting-edge LMMs. |
| Researcher Affiliation | Academia | Junho Kim Hyun Jun Kim Yeon Ju Kim Yong Man Ro Integrated Vision and Language Lab, KAIST, South Korea {arkimjh, kimhj709, yeonju7.kim, ymro}@kaist.ac.kr |
| Pseudocode | Yes | Algorithm 1 COuntering DEscription Contrastive Decoding |
| Open Source Code | Yes | Code is available at https://ivy-lvlm.github.io/CODE |
| Open Datasets | Yes | As discriminative benchmarks, we utilize mainly three datasets for detailed evaluation. Specifically, POPE [35] is a commonly used benchmark... MMVP [53] aims to evaluate... Realworld QA [57] is the most recent dataset... We use three benchmarks for generative benchmarks... LLaVA-QA90 [42] and LLaVA-Bench (In-the-Wild) [42]... MMHal-Bench [52] evaluates... |
| Dataset Splits | Yes | We implement our method on contemporary LMMs: LLaVA-1.5 (13B) [40], Emu2-Chat (14B) [51], Intern LM-XComposer2 (7B) [16], LLaVA-NeXT (34B) [41], Yi-VL (34B) [62], and Intern VL 1.5 (26B) [8]. |
| Hardware Specification | Yes | We compare three different model sizes. Throughput (token/s) Latency (ms/token) VCD OPERA CODE VCD OPERA CODE 7B [16] 5.62 1.23 3.66 177.99 809.73 272.92 14B [51] 4.04 1.04 2.82 247.6 960.14 354.09 34B [41] 3.61 oom 2.81 277.27 oom 355.81 We compare three different model sizes on 8 NVIDIA RTX A6000 GPUs as in Table. 4. |
| Software Dependencies | No | The paper mentions various LMMs (e.g., LLaVA-1.5, Emu2-Chat, Intern LM-XComposer2), but does not specify version numbers for general software dependencies like Python, PyTorch, or CUDA. |
| Experiment Setup | Yes | We used the default parameter settings for all methods, where top-p value 0.95 and temperature 1.0 for Nucleus sampling, the number of window size for searching is 5 (i.e., num-beams 5) for both beam search decoding and OPERA, and CD-α = 1, CD-β = 0.1 for VCD, and k = 0.3 for our method. |