A Theory of Dynamic Benchmarks

Authors: Ali Shirali, Rediet Abebe, Moritz Hardt

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We support our theoretical analysis by simulating dynamic benchmarks on two popular datasets.
Researcher Affiliation Academia Ali Shirali University of California, Berkeley Rediet Abebe Harvard Society of Fellows Moritz Hardt Max-Planck Institute for Intelligent Systems, T ubingen
Pseudocode Yes Hierarchical dynamic benchmarking: For an underlying distribution D and true classifier f, given an initial distribution D0 and an approximate risk minimizer A, depth-k width-w hierarchical dynamic benchmarks are constructed recursively: def. A(k)(D0): 1. h0 = A(k 1)(D0) 2. For t = 1, , w 1: (a) Dt 1 = D|ht 1(x) =f(x) (b) Dt = mix(D0, D0, D1, , Dt 1) (c) ht = A(k 1)(Dt) 3. return maj(h0, h1, , hw 1) where A(0) = A.
Open Source Code No The paper does not provide an explicit statement about the availability of open-source code for its methodology, nor does it include a direct link to a code repository.
Open Datasets Yes We study this question by simulating path dynamic benchmarks on two popular static benchmarks, CIFAR-10 (Krizhevsky et al., 2009) and SNLI (Bowman et al., 2015).
Dataset Splits No The paper mentions using 'training data' and 'test data' for models within its simulations (e.g., 'base classifier achieves 73% accuracy after 30 epochs on the training data' and 'base model achieved an accuracy of 68% on the training data and 61% on the test data'), but it does not explicitly provide the specific percentages or sample counts for the training, validation, and test splits of the CIFAR-10 or SNLI datasets for its own experiments.
Hardware Specification No The paper describes the model architectures used (CNN for CIFAR-10, Bi-LSTM for SNLI) but does not provide any specific details about the hardware used to run the experiments, such as GPU or CPU models.
Software Dependencies No The paper mentions the use of 'pre-trained 100d GloVe1 vectors' but does not specify any software dependencies (e.g., libraries, frameworks, or languages) with their corresponding version numbers that would be required to reproduce the experiments.
Experiment Setup Yes The base classifier achieves 73% accuracy after 30 epochs on the training data, reasonably above the chance level of about 10%." and "Draw multiple rollouts of path dynamic benchmarks. A rollout from path dynamic benchmark is a sequence of distributions and models obtained by alternatingly training a new classifier on the weighted extracted dataset and up-weighting (down-weighting) the distribution over misclassified (correctly classified) samples according to a mixture rule with uniform weights.