Detecting and Preventing Hallucinations in Large Vision Language Models
Authors: Anisha Gunjal, Jihan Yin, Erhan Bas
AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We perform human evaluation on both FDPO and rejection sampling, and find that they reduce hallucination rates in Instruct BLIP by 41% and 55% respectively. |
| Researcher Affiliation | Collaboration | Anisha Gunjal*, Jihan Yin*, Erhan Bas Scale AI anishagunjal@utexas.edu, jihan yin@berkeley.edu, erhan.bas@gehealthcare.com |
| Pseudocode | No | The paper does not contain any sections or figures explicitly labeled 'Pseudocode' or 'Algorithm', nor does it present structured steps formatted like code or an algorithm. |
| Open Source Code | No | The abstract states 'The dataset is available at https://github.com/hendryx-scale/mhal-detect.', which refers to the M-Hal Detect dataset, not the open-source code for the methodologies (FDPO, RS) described in the paper. |
| Open Datasets | Yes | We introduce M-Hal Detect, a Multimodal Hallucination Detection Dataset... The dataset is available at https://github.com/hendryx-scale/mhal-detect. and The dataset comprises of image-description pairs sampled from 4,000 images taken from the val2014 split of the Common Objects in Context (COCO) dataset (Lin et al. 2014). |
| Dataset Splits | Yes | The dataset is divided into a training set with 3,200 images and a development set with 800 images. This creates 16k image-prompt-response triplets, split between 12800 samples in the train split and 3200 samples in the val split. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU models, CPU types, or memory specifications used for running its experiments. |
| Software Dependencies | No | The paper mentions software components like 'Vicuna' and 'Natural Language Toolkit', but it does not provide specific version numbers for these or any other software dependencies needed to replicate the experiments. |
| Experiment Setup | Yes | We sample four responses using nucleus sampling from Instruct BLIP with a temperature value set to 1.0. We use β = 0.5 for all our FDPO experiments, and train for a maximum of 5 epochs with lr = 10 6, warmup ratio of .03, and a cosine scheduler. |