Mitigating Object Hallucination via Concentric Causal Attention
Authors: Yun Xing, Yiheng Li, Ivan Laptev, Shijian Lu
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We carry out pre-training and instruction tuning as [46] and verify our trained model on multiple object hallucination benchmarks [41, 57, 20] (+4.24% on Accuracy and +2.73% on F1 score, as compared to the state-of-the-art method [34] on POPE). Without bells and whistles, our positional alignment method surpasses existing hallucination mitigation strategies by large margins on multiple object hallucination benchmarks. |
| Researcher Affiliation | Academia | Yun Xing1 Yiheng Li1 Ivan Laptev2 Shijian Lu1 1 Nanyang Technological University 2 MBZUAI |
| Pseudocode | Yes | def compute_vis_inst_flow( attn, img_token_pos, img_token_len ): |
| Open Source Code | Yes | https://github.com/xing0047/cca-llava.git ... We refer readers to https://github.com/xing0047/cca-llava for details of data and code guideline. |
| Open Datasets | Yes | Following [46, 45], we adopt pre-trained CLIP Vi T-L/14 [55] ... Training consists of two stages, including 1) a pre-training over CC-558K dataset [46] with global batch size of 256 and 2) a instruction tuning with a 665k multi-turn conversation dataset [45] with global batch size of 128. |
| Dataset Splits | Yes | Training consists of two stages, including 1) a pre-training over CC-558K dataset [46] with global batch size of 256 and 2) a instruction tuning with a 665k multi-turn conversation dataset [45] with global batch size of 128. ... We sample 3,000 annotations from COCO VAL 2014 [42] to carry out our motivation experiments. |
| Hardware Specification | Yes | We use 4 NVIDIA RTX A6000s to train our models. |
| Software Dependencies | No | The paper mentions using specific models like CLIP Vi T-L/14 and Vicuna-7B, but does not provide specific version numbers for underlying software libraries or programming languages used in the experimental setup. |
| Experiment Setup | Yes | Following [46, 45], we adopt pre-trained CLIP Vi T-L/14 [55] with 336x336 resolutions as visual encoder and Vicuna-7B [12] as LLM, and a 2-layer MLP that connects the visual encoder and LLM. Training consists of two stages, including 1) a pre-training over CC-558K dataset [46] with global batch size of 256 and 2) a instruction tuning with a 665k multi-turn conversation dataset [45] with global batch size of 128. |