reproducibilityindex.ai

Mitigating Object Hallucination via Concentric Causal Attention

Authors: Yun Xing, Yiheng Li, Ivan Laptev, Shijian Lu

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We carry out pre-training and instruction tuning as [46] and verify our trained model on multiple object hallucination benchmarks [41, 57, 20] (+4.24% on Accuracy and +2.73% on F1 score, as compared to the state-of-the-art method [34] on POPE). Without bells and whistles, our positional alignment method surpasses existing hallucination mitigation strategies by large margins on multiple object hallucination benchmarks.
Researcher Affiliation	Academia	Yun Xing1 Yiheng Li1 Ivan Laptev2 Shijian Lu1 1 Nanyang Technological University 2 MBZUAI
Pseudocode	Yes	def compute_vis_inst_flow( attn, img_token_pos, img_token_len ):
Open Source Code	Yes	https://github.com/xing0047/cca-llava.git ... We refer readers to https://github.com/xing0047/cca-llava for details of data and code guideline.
Open Datasets	Yes	Following [46, 45], we adopt pre-trained CLIP Vi T-L/14 [55] ... Training consists of two stages, including 1) a pre-training over CC-558K dataset [46] with global batch size of 256 and 2) a instruction tuning with a 665k multi-turn conversation dataset [45] with global batch size of 128.
Dataset Splits	Yes	Training consists of two stages, including 1) a pre-training over CC-558K dataset [46] with global batch size of 256 and 2) a instruction tuning with a 665k multi-turn conversation dataset [45] with global batch size of 128. ... We sample 3,000 annotations from COCO VAL 2014 [42] to carry out our motivation experiments.
Hardware Specification	Yes	We use 4 NVIDIA RTX A6000s to train our models.
Software Dependencies	No	The paper mentions using specific models like CLIP Vi T-L/14 and Vicuna-7B, but does not provide specific version numbers for underlying software libraries or programming languages used in the experimental setup.
Experiment Setup	Yes	Following [46, 45], we adopt pre-trained CLIP Vi T-L/14 [55] with 336x336 resolutions as visual encoder and Vicuna-7B [12] as LLM, and a 2-layer MLP that connects the visual encoder and LLM. Training consists of two stages, including 1) a pre-training over CC-558K dataset [46] with global batch size of 256 and 2) a instruction tuning with a 665k multi-turn conversation dataset [45] with global batch size of 128.