Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Risk Aware Benchmarking of Large Language Models
Authors: Apoorva Nitsure, Youssef Mroueh, Mattia Rigotti, Kristjan Greenewald, Brian Belgodere, Mikhail Yurochkin, Jiri Navratil, Igor Melnyk, Jarret Ross
ICML 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We use our framework to compare various large language models regarding risks related to drifting from instructions and outputting toxic content. and 5. Experiments |
| Researcher Affiliation | Collaboration | 1IBM Research 2MIT-IBM Watson AI Lab. Correspondence to: Apoorva Nitsure <EMAIL>, Youssef Mroueh <EMAIL>. |
| Pseudocode | Yes | Algorithm 1 Stochastic Order Multi-testing (relative and absolute) and Algorithm 2 COMPUTEVIOLATIONRATIOS(Fa,Fb,order) |
| Open Source Code | No | Our code for relative and absolute testing performs all tests at once and relies on caching vectorization and multi-threading of the operations. Our code completes all tests in an average of just 17.7 s with 1000 bootstraps. Experiments were run on a CPU machine with 128 AMD cores, of which 2 were used. |
| Open Datasets | Yes | We use the data from (Jiang et al., 2023), that consists of an instruction, an input sentence and an expected output from the user, as well as the output of a set of different LLMs. and We use the real toxicity prompts dataset of Gehman et al. (2020) |
| Dataset Splits | No | The dataset consists of a training set of 100K samples and a test set of 5K samples. |
| Hardware Specification | Yes | Experiments were run on a CPU machine with 128 AMD cores, of which 2 were used. |
| Software Dependencies | No | We use for rank aggregation the R package of (Pihur et al., 2009). and deepsig. Deepsignificance. https://github.com/Kaleidophon/deep-significance, 2022. |
| Experiment Setup | Yes | We perform all our statistical tests with a significance level α = 0.05, and use 1000 bootstrap iterations. and We sample from each model, 10 completions per prompt using nucleus sampling (top-p sampling with p = 0.9 and a temperature of 1). |