Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Mitigating Object Hallucination via Concentric Causal Attention

Authors: Yun Xing, Yiheng Li, Ivan Laptev, Shijian Lu

NeurIPS 2024 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We carry out pre-training and instruction tuning as [46] and verify our trained model on multiple object hallucination benchmarks [41, 57, 20] (+4.24% on Accuracy and +2.73% on F1 score, as compared to the state-of-the-art method [34] on POPE). Without bells and whistles, our positional alignment method surpasses existing hallucination mitigation strategies by large margins on multiple object hallucination benchmarks.
Researcher Affiliation Academia Yun Xing1 Yiheng Li1 Ivan Laptev2 Shijian Lu1 1 Nanyang Technological University 2 MBZUAI
Pseudocode Yes def compute_vis_inst_flow( attn, img_token_pos, img_token_len ):
Open Source Code Yes https://github.com/xing0047/cca-llava.git ... We refer readers to https://github.com/xing0047/cca-llava for details of data and code guideline.
Open Datasets Yes Following [46, 45], we adopt pre-trained CLIP Vi T-L/14 [55] ... Training consists of two stages, including 1) a pre-training over CC-558K dataset [46] with global batch size of 256 and 2) a instruction tuning with a 665k multi-turn conversation dataset [45] with global batch size of 128.
Dataset Splits Yes Training consists of two stages, including 1) a pre-training over CC-558K dataset [46] with global batch size of 256 and 2) a instruction tuning with a 665k multi-turn conversation dataset [45] with global batch size of 128. ... We sample 3,000 annotations from COCO VAL 2014 [42] to carry out our motivation experiments.
Hardware Specification Yes We use 4 NVIDIA RTX A6000s to train our models.
Software Dependencies No The paper mentions using specific models like CLIP Vi T-L/14 and Vicuna-7B, but does not provide specific version numbers for underlying software libraries or programming languages used in the experimental setup.
Experiment Setup Yes Following [46, 45], we adopt pre-trained CLIP Vi T-L/14 [55] with 336x336 resolutions as visual encoder and Vicuna-7B [12] as LLM, and a 2-layer MLP that connects the visual encoder and LLM. Training consists of two stages, including 1) a pre-training over CC-558K dataset [46] with global batch size of 256 and 2) a instruction tuning with a 665k multi-turn conversation dataset [45] with global batch size of 128.