Position: TrustLLM: Trustworthiness in Large Language Models

Authors: Yue Huang, Lichao Sun, Haoran Wang, Siyuan Wu, Qihui Zhang, Yuan Li, Chujie Gao, Yixin Huang, Wenhan Lyu, Yixuan Zhang, Xiner Li, Hanchi Sun, Zhengliang Liu, Yixin Liu, Yijue Wang, Zhikun Zhang, Bertie Vidgen, Bhavya Kailkhura, Caiming Xiong, Chaowei Xiao, Chunyuan Li, Eric P. Xing, Furong Huang, Hao Liu, Heng Ji, Hongyi Wang, Huan Zhang, Huaxiu Yao, Manolis Kellis, Marinka Zitnik, Meng Jiang, Mohit Bansal, James Zou, Jian Pei, Jian Liu, Jianfeng Gao, Jiawei Han, Jieyu Zhao, Jiliang Tang, Jindong Wang, Joaquin Vanschoren, John Mitchell, Kai Shu, Kaidi Xu, Kai-Wei Chang, Lifang He, Lifu Huang, Michael Backes, Neil Zhenqiang Gong, Philip S. Yu, Pin-Yu Chen, Quanquan Gu, Ran Xu, Rex Ying, Shuiwang Ji, Suman Jana, Tianlong Chen, Tianming Liu, Tianyi Zhou, William Yang Wang, Xiang Li, Xiangliang Zhang, Xiao Wang, Xing Xie, Xun Chen, Xuyu Wang, Yan Liu, Yanfang Ye, Yinzhi Cao, Yong Chen, Yue Zhao

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental This paper introduces TRUSTLLM, a comprehensive study of trustworthiness in LLMs, including principles for different dimensions of trustworthiness, established benchmark, evaluation, and analysis of trustworthiness for mainstream LLMs, and discussion of open challenges and future directions. Specifically, we first propose a set of principles for trustworthy LLMs that span eight different dimensions. Based on these principles, we further establish a benchmark across six dimensions including truthfulness, safety, fairness, robustness, privacy, and machine ethics. We then present a study evaluating 16 mainstream LLMs in TRUSTLLM, consisting of over 30 datasets. Our findings firstly show that in general trustworthiness and capability (i.e., functional effectiveness) are positively related. Secondly, our observations reveal that proprietary LLMs generally outperform most open-source counterparts in terms of trustworthiness, raising concerns about the potential risks of widely accessible open-source LLMs. However, a few open-source LLMs come very close to proprietary ones, suggesting that open-source models can achieve high levels of trustworthiness without additional mechanisms like moderator, offering valuable insights for developers in this field. Thirdly, it is important to note that some LLMs may be overly calibrated towards exhibiting trustworthiness, to the extent that they compromise their utility by mistakenly treating benign prompts as harmful and consequently not responding. Besides these observations, we’ve uncovered key insights into the multifaceted trustworthiness in LLMs. We emphasize the importance of ensuring transparency not only in the models themselves but also in the technologies that underpin trustworthiness. We advocate that the establishment of an AI alliance between industry, academia, and the open-source community to foster collaboration is imperative to advance the trustworthiness of LLMs. Our dataset, code, and toolkit will be available at https://github.com/ Howie Hwong/Trust LLM and the leaderboard is released at https://trustllmbenchmark.github. io/Trust LLM-Website/. ... To facilitate the understanding of our study, in this section, we first present the observations and insights we have drawn based on our extensive empirical studies in this work.
Researcher Affiliation Collaboration Yue Huang 1 2 * Lichao Sun 1 * Haoran Wang 3 * Siyuan Wu 4 * Qihui Zhang 4 * Yuan Li 5 1 * Chujie Gao 4 * Yixin Huang 6 * Wenhan Lyu 7 * Yixuan Zhang 7 * Xiner Li 8 * Hanchi Sun 1 * Zhengliang Liu 9 * Yixin Liu 1 * Yijue Wang 10 * Zhikun Zhang 11 * Bertie Vidgen 12 13 Bhavya Kailkhura 14 Caiming Xiong 15 Chaowei Xiao 16 Chunyuan Li 17 Eric Xing 18 19 Furong Huang 20 Hao Liu 21 Heng Ji 22 Hongyi Wang 23 18 Huan Zhang 22 Huaxiu Yao 24 Manolis Kellis 25 Marinka Zitnik 26 Meng Jiang 2 Mohit Bansal 24 James Zou 11 Jian Pei 27 Jian Liu 28 Jianfeng Gao 17 Jiawei Han 22 Jieyu Zhao 29 Jiliang Tang 30 Jindong Wang 31 Joaquin Vanschoren 32 John C Mitchell 11 Kai Shu 3 Kaidi Xu 33 Kai-Wei Chang 34 Lifang He 1 Lifu Huang 35 Michael Backes 4 Neil Zhenqiang Gong 27 Philip S. Yu 36 Pin-Yu Chen 37 Quanquan Gu 34 Ran Xu 15 Rex Ying 38 Shuiwang Ji 8 Suman Jana 39 Tianlong Chen 24 Tianming Liu 9 Tianyi Zhou 20 William Wang 40 Xiang Li 41 Xiangliang Zhang 2 Xiao Wang 42 Xing Xie 31 Xun Chen 10 Xuyu Wang 43 Yan Liu 29 Yanfang Ye 2 Yinzhi Cao 44 Yong Chen 45 Yue Zhao 29 ... 1Lehigh University 2University of Notre Dame 3Illinois Institute of Technology 4CISPA 5University of Cambridge 6Institut Polytechnique de Paris 7William & Mary 8Texas A&M University 9University of Georgia 10Samsung Research America 11Stanford University 12MLCommons 13University of Oxford 14Lawrence Livermore National Laboratory 15Salesforce Research 16University of Wisconsin, Madison 17Microsoft Research 18Carnegie Mellon University 19Mohamed Bin Zayed University of Artificial Intelligence 20University of Maryland 21University of California, Berkeley 22University of Illinois Urbana-Champaign 23Rutgers University 24UNC Chapel Hill 25Massachusetts Institute of Technology 26Harvard University 27Duke University 28University of Tennessee, Knoxville 29University of Southern California 30Michigan State University 31Microsoft Research Asia 32Eindhoven University of Technology 33Drexel University 34University of California, Los Angeles 35Virginia Tech 36University of Illinois Chicago 37IBM Research 38Yale University 39Columbia University 40University of California, Santa Barbara 41Massachusetts General Hospital 42Northwestern University 43Florida International University 44Johns Hopkins University 45University of Pennsylvania. Correspondence to: Yue Huang <yhuang37@nd.edu>, Lichao Sun <lis221@lehigh.edu>.
Pseudocode No The paper describes methods and evaluation processes but does not include any pseudocode or formal algorithm blocks.
Open Source Code Yes Our dataset, code, and toolkit will be available at https://github.com/ Howie Hwong/Trust LLM and the leaderboard is released at https://trustllmbenchmark.github. io/Trust LLM-Website/.
Open Datasets Yes Datasets. In the benchmark, we introduce a collection of 30 datasets that have been meticulously selected to ensure a comprehensive evaluation of the diverse capabilities of LLMs. Each dataset provides a unique set of challenges. They benchmark the LLMs across various dimensions of trustworthy tasks. A detailed description and the specifications of these datasets are provided in Table 4. ... Table 4. Datasets and metrics in the benchmark. means the dataset is from prior work, and q means the dataset is first proposed in our benchmark. SQUAD2.0 (RAJPURKAR ET AL., 2018) ... CODAH (CHEN ET AL., 2019B) ... HOTPOTQA (YANG ET AL., 2018) ... ADVGLUE (WANG ET AL., 2021B) ... ETHICS (HENDRYCKS ET AL., 2020B).
Dataset Splits No The paper mentions using a "dev set" for some evaluations, implying a validation or development split. For instance, in Section G.1.1, it states, "We use the dev set to evaluate LLMs, and the number of test samples in each task is shown in Table 25." However, it does not consistently or comprehensively specify the division of datasets into distinct training, validation, and testing sets across all experiments, especially for the newly constructed datasets, nor does it provide the exact percentages or counts for validation splits.
Hardware Specification No The paper mentions training large models and LLMs, and discusses the high costs involved, but it does not specify any particular hardware components (e.g., GPU models, CPU types, or memory specifications) used for running the experiments.
Software Dependencies No The paper mentions specific models and APIs used for evaluation (e.g., "Longformer classifier", "GPT-4/Chat GPT Eval", "Open AI’s text-embedding-ada-002"), but it does not provide specific version numbers for these software components or any other ancillary software dependencies that would be required for replication.
Experiment Setup Yes In this study, we meticulously curate a diverse set of 16 LLMs, encompassing proprietary and open-weight examples. This collection represents a broad spectrum of model size, training data, methodologies employed, and functional capabilities, offering a comprehensive landscape for evaluation. We summarize the information of each LLM in Table 3. ... In this study, we meticulously curate a diverse set of 16 LLMs, encompassing proprietary and open-weight examples. This collection represents a broad spectrum of model size, training data, methodologies employed, and functional capabilities, offering a comprehensive landscape for evaluation. We summarize the information of each LLM in Table 3. ... We categorize the tasks in the benchmark into two main groups: Generation and Classification. Drawing from prior studies (Wang et al., 2023b), we employ a temperature setting of 0 for classification tasks to ensure more precise outputs. Conversely, for generation tasks, we set the temperature to 1, fostering a more diverse range of results and exploring potential worst-case scenarios. For instance, recent research suggests that elevating the temperature can enhance the success rate of jailbreaking (Huang et al., 2023f). For other settings like decoding methods, we use the default setting of each LLM.