reproducibilityindex.ai

tinyBenchmarks: evaluating LLMs with fewer examples

Authors: Felipe Maia Polo, Lucas Weber, Leshem Choshen, Yuekai Sun, Gongjun Xu, Mikhail Yurochkin

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our empirical analysis demonstrates that these tools and tiny benchmarks are sufficient to reliably and efficiently reproduce the original evaluation results1.
Researcher Affiliation	Collaboration	Felipe Maia Polo 1 Lucas Weber 2 Leshem Choshen 3 4 Yuekai Sun 1 Gongjun Xu 1 Mikhail Yurochkin 3 5 1Department of Statistics, University of Michigan, USA 2Department of Translation and Language Sciences, University of Pompeu Fabra, Spain 3IBM Research 4MIT 5MIT-IBM Watson AI Lab.
Pseudocode	No	The paper describes its methods using natural language and mathematical formulas but does not include any pseudocode or algorithm blocks.
Open Source Code	Yes	To use our methods for efficient LLM evaluation, please check https://github.com/felipemaiapolo/ tiny Benchmarks. This repository includes a Python package for model evaluation and tutorials.
Open Datasets	Yes	We release evaluation tools and tiny versions of popular benchmarks: Open LLM Leaderboard, MMLU, HELM, and Alpaca Eval 2.0. Our empirical analysis demonstrates that these tools and tiny benchmarks are sufficient to reliably and efficiently reproduce the original evaluation results1.
Dataset Splits	Yes	To select the dimension of the IRT model during the fitting procedure, we run a simple validation strategy in the training set and choose the dimension that maximizes the prediction power of the IRT model in the validation split we consider the dimensions in {2, 5, 10, 15}.
Hardware Specification	No	The paper mentions '4K GPU hours' and that their tool 'can be run on a CPU in a few seconds' but does not specify any particular GPU or CPU models, or detailed hardware configurations used for the experiments.
Software Dependencies	No	The paper mentions a 'Python package for model evaluation' but does not specify the version of Python or any other software libraries with their version numbers.
Experiment Setup	Yes	To select the dimension of the IRT model during the fitting procedure, we run a simple validation strategy in the training set and choose the dimension that maximizes the prediction power of the IRT model in the validation split we consider the dimensions in {2, 5, 10, 15}.