tinyBenchmarks: evaluating LLMs with fewer examples
Authors: Felipe Maia Polo, Lucas Weber, Leshem Choshen, Yuekai Sun, Gongjun Xu, Mikhail Yurochkin
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our empirical analysis demonstrates that these tools and tiny benchmarks are sufficient to reliably and efficiently reproduce the original evaluation results1. |
| Researcher Affiliation | Collaboration | Felipe Maia Polo 1 Lucas Weber 2 Leshem Choshen 3 4 Yuekai Sun 1 Gongjun Xu 1 Mikhail Yurochkin 3 5 1Department of Statistics, University of Michigan, USA 2Department of Translation and Language Sciences, University of Pompeu Fabra, Spain 3IBM Research 4MIT 5MIT-IBM Watson AI Lab. |
| Pseudocode | No | The paper describes its methods using natural language and mathematical formulas but does not include any pseudocode or algorithm blocks. |
| Open Source Code | Yes | To use our methods for efficient LLM evaluation, please check https://github.com/felipemaiapolo/ tiny Benchmarks. This repository includes a Python package for model evaluation and tutorials. |
| Open Datasets | Yes | We release evaluation tools and tiny versions of popular benchmarks: Open LLM Leaderboard, MMLU, HELM, and Alpaca Eval 2.0. Our empirical analysis demonstrates that these tools and tiny benchmarks are sufficient to reliably and efficiently reproduce the original evaluation results1. |
| Dataset Splits | Yes | To select the dimension of the IRT model during the fitting procedure, we run a simple validation strategy in the training set and choose the dimension that maximizes the prediction power of the IRT model in the validation split we consider the dimensions in {2, 5, 10, 15}. |
| Hardware Specification | No | The paper mentions '4K GPU hours' and that their tool 'can be run on a CPU in a few seconds' but does not specify any particular GPU or CPU models, or detailed hardware configurations used for the experiments. |
| Software Dependencies | No | The paper mentions a 'Python package for model evaluation' but does not specify the version of Python or any other software libraries with their version numbers. |
| Experiment Setup | Yes | To select the dimension of the IRT model during the fitting procedure, we run a simple validation strategy in the training set and choose the dimension that maximizes the prediction power of the IRT model in the validation split we consider the dimensions in {2, 5, 10, 15}. |