reproducibilityindex.ai

A Theory of Dynamic Benchmarks

Authors: Ali Shirali, Rediet Abebe, Moritz Hardt

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We support our theoretical analysis by simulating dynamic benchmarks on two popular datasets.
Researcher Affiliation	Academia	Ali Shirali University of California, Berkeley Rediet Abebe Harvard Society of Fellows Moritz Hardt Max-Planck Institute for Intelligent Systems, T ubingen
Pseudocode	Yes	Hierarchical dynamic benchmarking: For an underlying distribution D and true classifier f, given an initial distribution D0 and an approximate risk minimizer A, depth-k width-w hierarchical dynamic benchmarks are constructed recursively: def. A(k)(D0): 1. h0 = A(k 1)(D0) 2. For t = 1, , w 1: (a) Dt 1 = D\|ht 1(x) =f(x) (b) Dt = mix(D0, D0, D1, , Dt 1) (c) ht = A(k 1)(Dt) 3. return maj(h0, h1, , hw 1) where A(0) = A.
Open Source Code	No	The paper does not provide an explicit statement about the availability of open-source code for its methodology, nor does it include a direct link to a code repository.
Open Datasets	Yes	We study this question by simulating path dynamic benchmarks on two popular static benchmarks, CIFAR-10 (Krizhevsky et al., 2009) and SNLI (Bowman et al., 2015).
Dataset Splits	No	The paper mentions using 'training data' and 'test data' for models within its simulations (e.g., 'base classifier achieves 73% accuracy after 30 epochs on the training data' and 'base model achieved an accuracy of 68% on the training data and 61% on the test data'), but it does not explicitly provide the specific percentages or sample counts for the training, validation, and test splits of the CIFAR-10 or SNLI datasets for its own experiments.
Hardware Specification	No	The paper describes the model architectures used (CNN for CIFAR-10, Bi-LSTM for SNLI) but does not provide any specific details about the hardware used to run the experiments, such as GPU or CPU models.
Software Dependencies	No	The paper mentions the use of 'pre-trained 100d GloVe1 vectors' but does not specify any software dependencies (e.g., libraries, frameworks, or languages) with their corresponding version numbers that would be required to reproduce the experiments.
Experiment Setup	Yes	The base classifier achieves 73% accuracy after 30 epochs on the training data, reasonably above the chance level of about 10%." and "Draw multiple rollouts of path dynamic benchmarks. A rollout from path dynamic benchmark is a sequence of distributions and models obtained by alternatingly training a new classifier on the weighted extracted dataset and up-weighting (down-weighting) the distribution over misclassified (correctly classified) samples according to a mixture rule with uniform weights.