reproducibilityindex.ai

Inherent Trade-Offs between Diversity and Stability in Multi-Task Benchmarks

Authors: Guanhua Zhang, Moritz Hardt

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We empirically demonstrate a strong trade-off between diversity and sensitivity to irrelevant changes in existing multi-task benchmarks. Our result is based on new quantitative measures of diversity and sensitivity that we introduce. ... Through extensive experiments on seven cardinal benchmarks and eleven ordinal benchmarks, we demonstrate a clear trade-off between diversity and stability: The more diverse a multi-task benchmark, the more sensitive to trivial changes it is.
Researcher Affiliation	Academia	1Max Planck Institute for Intelligent Systems, Tübingen and Tübingen AI Center. Correspondence to: Guanhua Zhang <guanhua.zhang@tuebingen.mpg.de>.
Pseudocode	Yes	Algorithm 1 Sensitivity for Cardinal Benchmarks ... Algorithm 2 Sensitivity for Ordinal Benchmarks
Open Source Code	Yes	The codes and data are available at https://socialfoundations.github. io/benchbench/.
Open Datasets	Yes	For our experiment, we have collected seven widely-used benchmarks for our experiments, GLUE (Wang et al., 2018), Super GLUE (Wang et al., 2019), MTEB (Muennighoff et al., 2022), Big Bench Hard (Suzgun et al., 2022), MMLU (Hendrycks et al., 2020), Open LLM (Beeching et al., 2023; Gao et al., 2021) and VTAB (Zhai et al., 2019). ... The Image Net benchmark is based on the validation set of the ILSVRC2012 challenge (Deng et al., 2009). ... Our selected benchmarks for experiments consist of Big Code (Ben Allal et al., 2022), three benchmarks from HELM (Liang et al., 2023), and seven benchmarks from HEIM (Lee et al., 2023).
Dataset Splits	No	The paper describes using various established benchmarks (e.g., GLUE, ImageNet validation set) for evaluation, but it does not specify any train/validation/test splits that were applied to these benchmarks for the purpose of their own experiments or analysis. The models evaluated are pre-existing.
Hardware Specification	No	The paper does not explicitly describe the specific hardware (e.g., GPU models, CPU types, or cloud compute instances) used to run the experiments.
Software Dependencies	No	The paper mentions 'Torch Vision' but does not specify its version, nor does it list other software dependencies with their respective version numbers (e.g., Python, PyTorch, or specific libraries).
Experiment Setup	Yes	For the sensitivity calculation in each benchmark, we set minimal preserving portion ϵ = min{0.01, stdmin/stdmax}... λ is set as 0.0 and the number of gradient descent T is 1000. ... λ is set as 0.01 and the number of gradient descent T is 100.