tinyBenchmarks: evaluating LLMs with fewer examples

Authors: Felipe Maia Polo, Lucas Weber, Leshem Choshen, Yuekai Sun, Gongjun Xu, Mikhail Yurochkin

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our empirical analysis demonstrates that these tools and tiny benchmarks are sufficient to reliably and efficiently reproduce the original evaluation results1.
Researcher Affiliation Collaboration Felipe Maia Polo 1 Lucas Weber 2 Leshem Choshen 3 4 Yuekai Sun 1 Gongjun Xu 1 Mikhail Yurochkin 3 5 1Department of Statistics, University of Michigan, USA 2Department of Translation and Language Sciences, University of Pompeu Fabra, Spain 3IBM Research 4MIT 5MIT-IBM Watson AI Lab.
Pseudocode No The paper describes its methods using natural language and mathematical formulas but does not include any pseudocode or algorithm blocks.
Open Source Code Yes To use our methods for efficient LLM evaluation, please check https://github.com/felipemaiapolo/ tiny Benchmarks. This repository includes a Python package for model evaluation and tutorials.
Open Datasets Yes We release evaluation tools and tiny versions of popular benchmarks: Open LLM Leaderboard, MMLU, HELM, and Alpaca Eval 2.0. Our empirical analysis demonstrates that these tools and tiny benchmarks are sufficient to reliably and efficiently reproduce the original evaluation results1.
Dataset Splits Yes To select the dimension of the IRT model during the fitting procedure, we run a simple validation strategy in the training set and choose the dimension that maximizes the prediction power of the IRT model in the validation split we consider the dimensions in {2, 5, 10, 15}.
Hardware Specification No The paper mentions '4K GPU hours' and that their tool 'can be run on a CPU in a few seconds' but does not specify any particular GPU or CPU models, or detailed hardware configurations used for the experiments.
Software Dependencies No The paper mentions a 'Python package for model evaluation' but does not specify the version of Python or any other software libraries with their version numbers.
Experiment Setup Yes To select the dimension of the IRT model during the fitting procedure, we run a simple validation strategy in the training set and choose the dimension that maximizes the prediction power of the IRT model in the validation split we consider the dimensions in {2, 5, 10, 15}.