reproducibilityindex.ai

Detecting and Preventing Hallucinations in Large Vision Language Models

Authors: Anisha Gunjal, Jihan Yin, Erhan Bas

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We perform human evaluation on both FDPO and rejection sampling, and find that they reduce hallucination rates in Instruct BLIP by 41% and 55% respectively.
Researcher Affiliation	Collaboration	Anisha Gunjal, Jihan Yin, Erhan Bas Scale AI anishagunjal@utexas.edu, jihan yin@berkeley.edu, erhan.bas@gehealthcare.com
Pseudocode	No	The paper does not contain any sections or figures explicitly labeled 'Pseudocode' or 'Algorithm', nor does it present structured steps formatted like code or an algorithm.
Open Source Code	No	The abstract states 'The dataset is available at https://github.com/hendryx-scale/mhal-detect.', which refers to the M-Hal Detect dataset, not the open-source code for the methodologies (FDPO, RS) described in the paper.
Open Datasets	Yes	We introduce M-Hal Detect, a Multimodal Hallucination Detection Dataset... The dataset is available at https://github.com/hendryx-scale/mhal-detect. and The dataset comprises of image-description pairs sampled from 4,000 images taken from the val2014 split of the Common Objects in Context (COCO) dataset (Lin et al. 2014).
Dataset Splits	Yes	The dataset is divided into a training set with 3,200 images and a development set with 800 images. This creates 16k image-prompt-response triplets, split between 12800 samples in the train split and 3200 samples in the val split.
Hardware Specification	No	The paper does not provide specific hardware details such as GPU models, CPU types, or memory specifications used for running its experiments.
Software Dependencies	No	The paper mentions software components like 'Vicuna' and 'Natural Language Toolkit', but it does not provide specific version numbers for these or any other software dependencies needed to replicate the experiments.
Experiment Setup	Yes	We sample four responses using nucleus sampling from Instruct BLIP with a temperature value set to 1.0. We use β = 0.5 for all our FDPO experiments, and train for a maximum of 5 epochs with lr = 10 6, warmup ratio of .03, and a cosine scheduler.