Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Mitigating Object Hallucination via Concentric Causal Attention
Authors: Yun Xing, Yiheng Li, Ivan Laptev, Shijian Lu
NeurIPS 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We carry out pre-training and instruction tuning as [46] and verify our trained model on multiple object hallucination benchmarks [41, 57, 20] (+4.24% on Accuracy and +2.73% on F1 score, as compared to the state-of-the-art method [34] on POPE). Without bells and whistles, our positional alignment method surpasses existing hallucination mitigation strategies by large margins on multiple object hallucination benchmarks. |
| Researcher Affiliation | Academia | Yun Xing1 Yiheng Li1 Ivan Laptev2 Shijian Lu1 1 Nanyang Technological University 2 MBZUAI |
| Pseudocode | Yes | def compute_vis_inst_flow( attn, img_token_pos, img_token_len ): |
| Open Source Code | Yes | https://github.com/xing0047/cca-llava.git ... We refer readers to https://github.com/xing0047/cca-llava for details of data and code guideline. |
| Open Datasets | Yes | Following [46, 45], we adopt pre-trained CLIP Vi T-L/14 [55] ... Training consists of two stages, including 1) a pre-training over CC-558K dataset [46] with global batch size of 256 and 2) a instruction tuning with a 665k multi-turn conversation dataset [45] with global batch size of 128. |
| Dataset Splits | Yes | Training consists of two stages, including 1) a pre-training over CC-558K dataset [46] with global batch size of 256 and 2) a instruction tuning with a 665k multi-turn conversation dataset [45] with global batch size of 128. ... We sample 3,000 annotations from COCO VAL 2014 [42] to carry out our motivation experiments. |
| Hardware Specification | Yes | We use 4 NVIDIA RTX A6000s to train our models. |
| Software Dependencies | No | The paper mentions using specific models like CLIP Vi T-L/14 and Vicuna-7B, but does not provide specific version numbers for underlying software libraries or programming languages used in the experimental setup. |
| Experiment Setup | Yes | Following [46, 45], we adopt pre-trained CLIP Vi T-L/14 [55] with 336x336 resolutions as visual encoder and Vicuna-7B [12] as LLM, and a 2-layer MLP that connects the visual encoder and LLM. Training consists of two stages, including 1) a pre-training over CC-558K dataset [46] with global batch size of 256 and 2) a instruction tuning with a 665k multi-turn conversation dataset [45] with global batch size of 128. |