Inherent Trade-Offs between Diversity and Stability in Multi-Task Benchmarks
Authors: Guanhua Zhang, Moritz Hardt
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We empirically demonstrate a strong trade-off between diversity and sensitivity to irrelevant changes in existing multi-task benchmarks. Our result is based on new quantitative measures of diversity and sensitivity that we introduce. ... Through extensive experiments on seven cardinal benchmarks and eleven ordinal benchmarks, we demonstrate a clear trade-off between diversity and stability: The more diverse a multi-task benchmark, the more sensitive to trivial changes it is. |
| Researcher Affiliation | Academia | 1Max Planck Institute for Intelligent Systems, Tübingen and Tübingen AI Center. Correspondence to: Guanhua Zhang <guanhua.zhang@tuebingen.mpg.de>. |
| Pseudocode | Yes | Algorithm 1 Sensitivity for Cardinal Benchmarks ... Algorithm 2 Sensitivity for Ordinal Benchmarks |
| Open Source Code | Yes | The codes and data are available at https://socialfoundations.github. io/benchbench/. |
| Open Datasets | Yes | For our experiment, we have collected seven widely-used benchmarks for our experiments, GLUE (Wang et al., 2018), Super GLUE (Wang et al., 2019), MTEB (Muennighoff et al., 2022), Big Bench Hard (Suzgun et al., 2022), MMLU (Hendrycks et al., 2020), Open LLM (Beeching et al., 2023; Gao et al., 2021) and VTAB (Zhai et al., 2019). ... The Image Net benchmark is based on the validation set of the ILSVRC2012 challenge (Deng et al., 2009). ... Our selected benchmarks for experiments consist of Big Code (Ben Allal et al., 2022), three benchmarks from HELM (Liang et al., 2023), and seven benchmarks from HEIM (Lee et al., 2023). |
| Dataset Splits | No | The paper describes using various established benchmarks (e.g., GLUE, ImageNet validation set) for evaluation, but it does not specify any train/validation/test splits that were applied to these benchmarks for the purpose of their own experiments or analysis. The models evaluated are pre-existing. |
| Hardware Specification | No | The paper does not explicitly describe the specific hardware (e.g., GPU models, CPU types, or cloud compute instances) used to run the experiments. |
| Software Dependencies | No | The paper mentions 'Torch Vision' but does not specify its version, nor does it list other software dependencies with their respective version numbers (e.g., Python, PyTorch, or specific libraries). |
| Experiment Setup | Yes | For the sensitivity calculation in each benchmark, we set minimal preserving portion ϵ = min{0.01, stdmin/stdmax}... λ is set as 0.0 and the number of gradient descent T is 1000. ... λ is set as 0.01 and the number of gradient descent T is 100. |