Risk Aware Benchmarking of Large Language Models

Authors: Apoorva Nitsure, Youssef Mroueh, Mattia Rigotti, Kristjan Greenewald, Brian Belgodere, Mikhail Yurochkin, Jiri Navratil, Igor Melnyk, Jarret Ross

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We use our framework to compare various large language models regarding risks related to drifting from instructions and outputting toxic content. and 5. Experiments
Researcher Affiliation Collaboration 1IBM Research 2MIT-IBM Watson AI Lab. Correspondence to: Apoorva Nitsure <Apoorva.Nitsure@ibm.com>, Youssef Mroueh <mroueh@us.ibm.com>.
Pseudocode Yes Algorithm 1 Stochastic Order Multi-testing (relative and absolute) and Algorithm 2 COMPUTEVIOLATIONRATIOS(Fa,Fb,order)
Open Source Code No Our code for relative and absolute testing performs all tests at once and relies on caching vectorization and multi-threading of the operations. Our code completes all tests in an average of just 17.7 s with 1000 bootstraps. Experiments were run on a CPU machine with 128 AMD cores, of which 2 were used.
Open Datasets Yes We use the data from (Jiang et al., 2023), that consists of an instruction, an input sentence and an expected output from the user, as well as the output of a set of different LLMs. and We use the real toxicity prompts dataset of Gehman et al. (2020)
Dataset Splits No The dataset consists of a training set of 100K samples and a test set of 5K samples.
Hardware Specification Yes Experiments were run on a CPU machine with 128 AMD cores, of which 2 were used.
Software Dependencies No We use for rank aggregation the R package of (Pihur et al., 2009). and deepsig. Deepsignificance. https://github.com/Kaleidophon/deep-significance, 2022.
Experiment Setup Yes We perform all our statistical tests with a significance level α = 0.05, and use 1000 bootstrap iterations. and We sample from each model, 10 completions per prompt using nucleus sampling (top-p sampling with p = 0.9 and a temperature of 1).