Multivariate Stochastic Dominance via Optimal Transport and Applications to Models Benchmarking
Authors: Gabriel Rioux, Apoorva Nitsure, Mattia Rigotti, Kristjan Greenewald, Youssef Mroueh
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We showcase our method in comparing and benchmarking Large Language Models that are evaluated on multiple metrics. |
| Researcher Affiliation | Collaboration | Gabriel Rioux Center for Applied Mathematics Cornell University Apoorva Nitsure MIT-IBM Watson AI Lab IBM Research Mattia Rigotti MIT-IBM Watson AI Lab IBM Research Kristjan Greenewald MIT-IBM Watson AI Lab IBM Research Youssef Mroueh MIT-IBM Watson AI Lab IBM Research |
| Pseudocode | Yes | A Algorithm Algorithm 1 Multivariate Stochastic Order Multi-testing (relative and absolute) |
| Open Source Code | Yes | Code for these experiments is available at https://github.com/IBM/stochastic-order-eval. |
| Open Datasets | Yes | For our first evaluation we use the dataset from Jiang et al. [2023] (MIT license) that consists of responses from 12 different instruction following LLMs |
| Dataset Splits | No | The data has a train (100K rows) and test (5k rows) split where each row consists of an instruction, input sentence, the expected output from users, as well as the responses of a set of different LLMs with their decoding parameters and evaluation scores on different metrics. |
| Hardware Specification | Yes | All experiments were run on NVIDIA A100 80GB GPUs |
| Software Dependencies | Yes | All experiments were run on NVIDIA A100 80GB GPUs using Py Torch [Ansel et al., 2024] (v.2.3.0, BSD-3 license) and the Python Optimal Transport package [Flamary et al., 2021] (v.0.9.3, MIT license) |
| Experiment Setup | Yes | We then compute the pairwise ratios for these empirical distributions using the logistic loss with β = 0.2, the regularization parameter λ = 0.1, and utilize the relative testing procedure from Section 4.2 to rank the 12 LLMS |