Multi-Object Hallucination in Vision Language Models
Authors: Xuweiyi Chen, Ziqiao Ma, Xuejun Zhang, Sihan Xu, Shengyi Qian, Jianing Yang, David Fouhey, Joyce Chai
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We introduce Recognition-based Object Probing Evaluation (ROPE), an automated evaluation protocol that considers the distribution of object classes within a single image during testing and uses visual referring prompts to eliminate ambiguity. With comprehensive empirical studies and analysis of potential factors leading to multi-object hallucination, we found that... |
| Researcher Affiliation | Academia | 1University of Michigan 2University of Virginia 3New York University |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | No | We will release both the data and code to the public soon. For code, we plan to release both the evaluation template and the code that we have used for data curation. |
| Open Datasets | Yes | We build our dataset upon existing panoptic segmentation datasets, including MSCOCOPanoptic [28, 5] and ADE20K [74]... Available at https://huggingface.co/datasets/sled-umich/ROPE |
| Dataset Splits | Yes | To evaluate whether multi-object hallucination can be observed in both seen and unseen images, and to critically determine if training on these images helps reduce hallucinations, we explicitly split our dataset into Seen and Unseen based on the original split of the datasets. |
| Hardware Specification | Yes | Our experiments were conducted on eight A40 and four A100 GPUs slightly over a week. |
| Software Dependencies | No | The paper mentions various LVLMs and base LLMs with some versions, but does not specify software dependencies like programming languages, machine learning frameworks, or specific libraries with version numbers (e.g., Python, PyTorch, scikit-learn versions). |
| Experiment Setup | Yes | ROPE tasks LVLMs with selecting the best matching class for multiple objects, as referred to by the visual prompt, from a predefined set of object classes. Specifically, each sample in the ROPE protocol consists of a quadruple {I, L, p1, , pn , o1, , on }: (1) an image I consisting of at least n objects; (2) a natural language instruction L that specifies the recognition task, including N candidate object classes c1, , c N; (3) n visual prompts p1, , pn, each queries an object in the image; and (4) n object classes o1, , on as the answers. In this work, we construct a dataset with N = 50 and n = 5... For other LVLMs, we overlay the visual prompts on the images using a red bounding box with a width of 2 and visual text specifying the object index, presented with a white italic font on a black background with an alpha value of 0.75 for contrast and visibility. |