Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
tinyBenchmarks: evaluating LLMs with fewer examples
Authors: Felipe Maia Polo, Lucas Weber, Leshem Choshen, Yuekai Sun, Gongjun Xu, Mikhail Yurochkin
ICML 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our empirical analysis demonstrates that these tools and tiny benchmarks are sufficient to reliably and efficiently reproduce the original evaluation results1. |
| Researcher Affiliation | Collaboration | Felipe Maia Polo 1 Lucas Weber 2 Leshem Choshen 3 4 Yuekai Sun 1 Gongjun Xu 1 Mikhail Yurochkin 3 5 1Department of Statistics, University of Michigan, USA 2Department of Translation and Language Sciences, University of Pompeu Fabra, Spain 3IBM Research 4MIT 5MIT-IBM Watson AI Lab. |
| Pseudocode | No | The paper describes its methods using natural language and mathematical formulas but does not include any pseudocode or algorithm blocks. |
| Open Source Code | Yes | To use our methods for efficient LLM evaluation, please check https://github.com/felipemaiapolo/ tiny Benchmarks. This repository includes a Python package for model evaluation and tutorials. |
| Open Datasets | Yes | We release evaluation tools and tiny versions of popular benchmarks: Open LLM Leaderboard, MMLU, HELM, and Alpaca Eval 2.0. Our empirical analysis demonstrates that these tools and tiny benchmarks are sufficient to reliably and efficiently reproduce the original evaluation results1. |
| Dataset Splits | Yes | To select the dimension of the IRT model during the fitting procedure, we run a simple validation strategy in the training set and choose the dimension that maximizes the prediction power of the IRT model in the validation split we consider the dimensions in {2, 5, 10, 15}. |
| Hardware Specification | No | The paper mentions '4K GPU hours' and that their tool 'can be run on a CPU in a few seconds' but does not specify any particular GPU or CPU models, or detailed hardware configurations used for the experiments. |
| Software Dependencies | No | The paper mentions a 'Python package for model evaluation' but does not specify the version of Python or any other software libraries with their version numbers. |
| Experiment Setup | Yes | To select the dimension of the IRT model during the fitting procedure, we run a simple validation strategy in the training set and choose the dimension that maximizes the prediction power of the IRT model in the validation split we consider the dimensions in {2, 5, 10, 15}. |