Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Sight Beyond Text: Multi-Modal Training Enhances LLMs in Truthfulness and Ethics

Authors: Haoqin Tu, Bingchen Zhao, Chen Wei, Cihang Xie

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	This paper aims to get out of the box and showcase an intriguing characteristic of multi-modal trained LLMs our preliminary results suggest that visual instruction tuning, a prevailing strategy to integrate vision knowledge into the LLMs, unexpectedly and interestingly helps models attain both improved truthfulness and ethical alignment in the pure NLP context. For example, a visual-instruction-tuned LLa MA2 7B model surpasses the performance of the LLa MA2-chat 7B model, fine-tuned with over one million human annotations, on Truthful QA and Ethics benchmarks. Similarly, the latest LLa MA3 series also shows consistent performance gains by 0.6% on average following visualinstruction tuning. Another example is that two versions of proprietary model GPT-4V-turbo, which incorporates visual information, surpasses its LLM-only counterpart GPT-4-turbo by around 1.6% on both aspects. Further analysis reveals that the improved alignment can be attributed to the superior instruction quality inherent to visual-text data.
Researcher Affiliation	Academia	Haoqin Tu1 , Bingchen Zhao2 , Chen Wei3, Cihang Xie1 1University of California, Santa Cruz 2University of Edinburgh 3Rice University
Pseudocode	No	The paper describes the model architecture and training procedure in text, but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	In releasing our code at https://github.com/UCSC-VLAA/Sight-Beyond-Text, we aspire to foster further exploration into the intrinsic value of visual-text synergies, and in a broader scope, multi-modal interactions in alignment research.
Open Datasets	Yes	Our stance is informed by empirical evidence demonstrating the beneficial impact of diverse data sources on LLM capabilities. For example, the inclusion of code data has been shown to improve the reasoning ability of LLMs (Ma et al., 2024). [...] In our preliminary explorations, we tune LLa MA series models (Touvron et al., 2023a;b) with the visual instruction data from LLa VA (Liu et al., 2023b;a). The results of these experiments are intriguing: for a vanilla LLa MA2 7B model, visual instruction tuning can register impressive scores of 46.0% on Truthful QA-mc (+7.1%) (Lin et al., 2022) and 65.4% on Ethics (+19.6%) (Hendrycks et al., 2020), depending on the specific tuning approach. [...] Data-wise, we adhere to the protocols set by LLa VA (Liu et al., 2023b): the connector is initially trained using 595k image-text pairings filtered from CC3M (Changpinyo et al., 2021); the subsequent stage that requires LLM training utilizes 158k instructionsfollowing data from LLa VA with 80k unique images, which contains image-grounded conversation, image descriptions, and image-based complex reasoning tasks. To investigate the factors driving the improvements in visual instruction tuning, we also explore tuning the model using only text-based instruction data. We utilize three types of text-only data (sampled to equal sizes): visual instruction tuning data without images, Alpaca data (Taori et al., 2023), and Orca data (Lian et al., 2023). [...] For the Ethics benchmark, we use accuracy as the evaluation metric. For Truthful QA, we follow the official repository and use Rouge and/or BLEU accuracy for generation tasks, along with single-true (mc1) and multi-true (mc2) metrics for question-answering. [...] We hereby test the visual-instruction tuned models on recent multi-modal benchmarks, where five tasks are deployed: Unicorn benchmark (Tu et al., 2023a) dedicates evaluating the MLLM ability in safety scenarios, we take two OODCV-VQA tasks and Sketchy-VQA tasks for testing whether models can well handle OOD visual/text input and sketch images, respectively. MME (Fu et al., 2023) consists of two evaluation aspects, i.e., cognition (CS) and perception (PS) with total 14 VQA tasks;1 MSCOCO (Lin et al., 2014) and Flickr30k (Young et al., 2014) captioning tasks are commonly used benchmarks in the field of image caption generation. [...] POPE (Li et al., 2023c) is used to evaluate the level of object hallucinations in MLLMs, which consists of three versions of balanced yes/no VQA tasks considering objects in the given image. It is built upon MSCOCO-2017 dataset (Lin etal., 2014). Additionally, We also make use of the image corruptions proposed in Image Net-C (Hendrycks & Dietterich, 2019) to measure the performance of the MLLMs on corrupted images for MSCOCO task (denoted as MSCOCO-C).
Dataset Splits	Yes	Data-wise, we adhere to the protocols set by LLa VA (Liu et al., 2023b): the connector is initially trained using 595k image-text pairings filtered from CC3M (Changpinyo et al., 2021); the subsequent stage that requires LLM training utilizes 158k instructionsfollowing data from LLa VA with 80k unique images, which contains image-grounded conversation, image descriptions, and image-based complex reasoning tasks. To investigate the factors driving the improvements in visual instruction tuning, we also explore tuning the model using only text-based instruction data. We utilize three types of text-only data (sampled to equal sizes): visual instruction tuning data without images, Alpaca data (Taori et al., 2023), and Orca data (Lian et al., 2023). [...] For a fair comparison, we randomly sample 80K data from Alpaca and Orca data respectively for the training. [...] Specifically, we utilize data from LLa VA (Liu et al., 2023b), which categorizes visual instruction tuning data into three groups: Conversation, Details, and Reasoning. Each group comprises 20k data points, sampled from the original training splits. For a fair comparison, we also take a uniform sample of 20k from the full 80k visual instructions to form the baseline group. We tune LLa MA2 and LLa MA2-chat with each data group (of 20k data points) separately, and report the results in fig. 3.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., GPU models, CPU types, memory amounts) used for running the experiments. It only generally acknowledges support for "computing and resource needs".
Software Dependencies	No	The paper mentions various models (e.g., LLaMA, CLIP ViT-L/14) and frameworks (e.g., LLaVA) but does not provide specific version numbers for any software dependencies like programming languages, libraries, or toolkits.
Experiment Setup	No	The paper states, "We strictly adhere to the setups in LLaVA (Liu et al., 2023b) for fine-tuning LLMs on visual instruction tuning data." While it describes the training procedure in two stages and the data used, it does not explicitly detail hyperparameters such as learning rate, batch size, or number of epochs within this paper. These details are deferred to the referenced work or are not specified.