A Theory of Dynamic Benchmarks
Authors: Ali Shirali, Rediet Abebe, Moritz Hardt
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We support our theoretical analysis by simulating dynamic benchmarks on two popular datasets. |
| Researcher Affiliation | Academia | Ali Shirali University of California, Berkeley Rediet Abebe Harvard Society of Fellows Moritz Hardt Max-Planck Institute for Intelligent Systems, T ubingen |
| Pseudocode | Yes | Hierarchical dynamic benchmarking: For an underlying distribution D and true classifier f, given an initial distribution D0 and an approximate risk minimizer A, depth-k width-w hierarchical dynamic benchmarks are constructed recursively: def. A(k)(D0): 1. h0 = A(k 1)(D0) 2. For t = 1, , w 1: (a) Dt 1 = D|ht 1(x) =f(x) (b) Dt = mix(D0, D0, D1, , Dt 1) (c) ht = A(k 1)(Dt) 3. return maj(h0, h1, , hw 1) where A(0) = A. |
| Open Source Code | No | The paper does not provide an explicit statement about the availability of open-source code for its methodology, nor does it include a direct link to a code repository. |
| Open Datasets | Yes | We study this question by simulating path dynamic benchmarks on two popular static benchmarks, CIFAR-10 (Krizhevsky et al., 2009) and SNLI (Bowman et al., 2015). |
| Dataset Splits | No | The paper mentions using 'training data' and 'test data' for models within its simulations (e.g., 'base classifier achieves 73% accuracy after 30 epochs on the training data' and 'base model achieved an accuracy of 68% on the training data and 61% on the test data'), but it does not explicitly provide the specific percentages or sample counts for the training, validation, and test splits of the CIFAR-10 or SNLI datasets for its own experiments. |
| Hardware Specification | No | The paper describes the model architectures used (CNN for CIFAR-10, Bi-LSTM for SNLI) but does not provide any specific details about the hardware used to run the experiments, such as GPU or CPU models. |
| Software Dependencies | No | The paper mentions the use of 'pre-trained 100d GloVe1 vectors' but does not specify any software dependencies (e.g., libraries, frameworks, or languages) with their corresponding version numbers that would be required to reproduce the experiments. |
| Experiment Setup | Yes | The base classifier achieves 73% accuracy after 30 epochs on the training data, reasonably above the chance level of about 10%." and "Draw multiple rollouts of path dynamic benchmarks. A rollout from path dynamic benchmark is a sequence of distributions and models obtained by alternatingly training a new classifier on the weighted extracted dataset and up-weighting (down-weighting) the distribution over misclassified (correctly classified) samples according to a mixture rule with uniform weights. |