Risk Aware Benchmarking of Large Language Models
Authors: Apoorva Nitsure, Youssef Mroueh, Mattia Rigotti, Kristjan Greenewald, Brian Belgodere, Mikhail Yurochkin, Jiri Navratil, Igor Melnyk, Jarret Ross
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We use our framework to compare various large language models regarding risks related to drifting from instructions and outputting toxic content. and 5. Experiments |
| Researcher Affiliation | Collaboration | 1IBM Research 2MIT-IBM Watson AI Lab. Correspondence to: Apoorva Nitsure <Apoorva.Nitsure@ibm.com>, Youssef Mroueh <mroueh@us.ibm.com>. |
| Pseudocode | Yes | Algorithm 1 Stochastic Order Multi-testing (relative and absolute) and Algorithm 2 COMPUTEVIOLATIONRATIOS(Fa,Fb,order) |
| Open Source Code | No | Our code for relative and absolute testing performs all tests at once and relies on caching vectorization and multi-threading of the operations. Our code completes all tests in an average of just 17.7 s with 1000 bootstraps. Experiments were run on a CPU machine with 128 AMD cores, of which 2 were used. |
| Open Datasets | Yes | We use the data from (Jiang et al., 2023), that consists of an instruction, an input sentence and an expected output from the user, as well as the output of a set of different LLMs. and We use the real toxicity prompts dataset of Gehman et al. (2020) |
| Dataset Splits | No | The dataset consists of a training set of 100K samples and a test set of 5K samples. |
| Hardware Specification | Yes | Experiments were run on a CPU machine with 128 AMD cores, of which 2 were used. |
| Software Dependencies | No | We use for rank aggregation the R package of (Pihur et al., 2009). and deepsig. Deepsignificance. https://github.com/Kaleidophon/deep-significance, 2022. |
| Experiment Setup | Yes | We perform all our statistical tests with a significance level α = 0.05, and use 1000 bootstrap iterations. and We sample from each model, 10 completions per prompt using nucleus sampling (top-p sampling with p = 0.9 and a temperature of 1). |