Mitigating Object Hallucination via Concentric Causal Attention

Authors: Yun Xing, Yiheng Li, Ivan Laptev, Shijian Lu

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We carry out pre-training and instruction tuning as [46] and verify our trained model on multiple object hallucination benchmarks [41, 57, 20] (+4.24% on Accuracy and +2.73% on F1 score, as compared to the state-of-the-art method [34] on POPE). Without bells and whistles, our positional alignment method surpasses existing hallucination mitigation strategies by large margins on multiple object hallucination benchmarks.
Researcher Affiliation Academia Yun Xing1 Yiheng Li1 Ivan Laptev2 Shijian Lu1 1 Nanyang Technological University 2 MBZUAI
Pseudocode Yes def compute_vis_inst_flow( attn, img_token_pos, img_token_len ):
Open Source Code Yes https://github.com/xing0047/cca-llava.git ... We refer readers to https://github.com/xing0047/cca-llava for details of data and code guideline.
Open Datasets Yes Following [46, 45], we adopt pre-trained CLIP Vi T-L/14 [55] ... Training consists of two stages, including 1) a pre-training over CC-558K dataset [46] with global batch size of 256 and 2) a instruction tuning with a 665k multi-turn conversation dataset [45] with global batch size of 128.
Dataset Splits Yes Training consists of two stages, including 1) a pre-training over CC-558K dataset [46] with global batch size of 256 and 2) a instruction tuning with a 665k multi-turn conversation dataset [45] with global batch size of 128. ... We sample 3,000 annotations from COCO VAL 2014 [42] to carry out our motivation experiments.
Hardware Specification Yes We use 4 NVIDIA RTX A6000s to train our models.
Software Dependencies No The paper mentions using specific models like CLIP Vi T-L/14 and Vicuna-7B, but does not provide specific version numbers for underlying software libraries or programming languages used in the experimental setup.
Experiment Setup Yes Following [46, 45], we adopt pre-trained CLIP Vi T-L/14 [55] with 336x336 resolutions as visual encoder and Vicuna-7B [12] as LLM, and a 2-layer MLP that connects the visual encoder and LLM. Training consists of two stages, including 1) a pre-training over CC-558K dataset [46] with global batch size of 256 and 2) a instruction tuning with a 665k multi-turn conversation dataset [45] with global batch size of 128.