Multi-Object Hallucination in Vision Language Models

Authors: Xuweiyi Chen, Ziqiao Ma, Xuejun Zhang, Sihan Xu, Shengyi Qian, Jianing Yang, David Fouhey, Joyce Chai

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We introduce Recognition-based Object Probing Evaluation (ROPE), an automated evaluation protocol that considers the distribution of object classes within a single image during testing and uses visual referring prompts to eliminate ambiguity. With comprehensive empirical studies and analysis of potential factors leading to multi-object hallucination, we found that...
Researcher Affiliation Academia 1University of Michigan 2University of Virginia 3New York University
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code No We will release both the data and code to the public soon. For code, we plan to release both the evaluation template and the code that we have used for data curation.
Open Datasets Yes We build our dataset upon existing panoptic segmentation datasets, including MSCOCOPanoptic [28, 5] and ADE20K [74]... Available at https://huggingface.co/datasets/sled-umich/ROPE
Dataset Splits Yes To evaluate whether multi-object hallucination can be observed in both seen and unseen images, and to critically determine if training on these images helps reduce hallucinations, we explicitly split our dataset into Seen and Unseen based on the original split of the datasets.
Hardware Specification Yes Our experiments were conducted on eight A40 and four A100 GPUs slightly over a week.
Software Dependencies No The paper mentions various LVLMs and base LLMs with some versions, but does not specify software dependencies like programming languages, machine learning frameworks, or specific libraries with version numbers (e.g., Python, PyTorch, scikit-learn versions).
Experiment Setup Yes ROPE tasks LVLMs with selecting the best matching class for multiple objects, as referred to by the visual prompt, from a predefined set of object classes. Specifically, each sample in the ROPE protocol consists of a quadruple {I, L, p1, , pn , o1, , on }: (1) an image I consisting of at least n objects; (2) a natural language instruction L that specifies the recognition task, including N candidate object classes c1, , c N; (3) n visual prompts p1, , pn, each queries an object in the image; and (4) n object classes o1, , on as the answers. In this work, we construct a dataset with N = 50 and n = 5... For other LVLMs, we overlay the visual prompts on the images using a red bounding box with a width of 2 and visual text specifying the object index, presented with a white italic font on a black background with an alpha value of 0.75 for contrast and visibility.