reproducibilityindex.ai

Risk Aware Benchmarking of Large Language Models

Authors: Apoorva Nitsure, Youssef Mroueh, Mattia Rigotti, Kristjan Greenewald, Brian Belgodere, Mikhail Yurochkin, Jiri Navratil, Igor Melnyk, Jarret Ross

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We use our framework to compare various large language models regarding risks related to drifting from instructions and outputting toxic content. and 5. Experiments
Researcher Affiliation	Collaboration	1IBM Research 2MIT-IBM Watson AI Lab. Correspondence to: Apoorva Nitsure <Apoorva.Nitsure@ibm.com>, Youssef Mroueh <mroueh@us.ibm.com>.
Pseudocode	Yes	Algorithm 1 Stochastic Order Multi-testing (relative and absolute) and Algorithm 2 COMPUTEVIOLATIONRATIOS(Fa,Fb,order)
Open Source Code	No	Our code for relative and absolute testing performs all tests at once and relies on caching vectorization and multi-threading of the operations. Our code completes all tests in an average of just 17.7 s with 1000 bootstraps. Experiments were run on a CPU machine with 128 AMD cores, of which 2 were used.
Open Datasets	Yes	We use the data from (Jiang et al., 2023), that consists of an instruction, an input sentence and an expected output from the user, as well as the output of a set of different LLMs. and We use the real toxicity prompts dataset of Gehman et al. (2020)
Dataset Splits	No	The dataset consists of a training set of 100K samples and a test set of 5K samples.
Hardware Specification	Yes	Experiments were run on a CPU machine with 128 AMD cores, of which 2 were used.
Software Dependencies	No	We use for rank aggregation the R package of (Pihur et al., 2009). and deepsig. Deepsignificance. https://github.com/Kaleidophon/deep-significance, 2022.
Experiment Setup	Yes	We perform all our statistical tests with a significance level α = 0.05, and use 1000 bootstrap iterations. and We sample from each model, 10 completions per prompt using nucleus sampling (top-p sampling with p = 0.9 and a temperature of 1).