Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Dynaboard: An Evaluation-As-A-Service Platform for Holistic Next-Generation Benchmarking

Authors: Zhiyi Ma, Kawin Ethayarajh, Tristan Thrush, Somya Jain, Ledell Wu, Robin Jia, Christopher Potts, Adina Williams, Douwe Kiela

NeurIPS 2021 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We introduce Dynaboard, an evaluation-as-a-service framework for hosting benchmarks and conducting holistic model comparison, integrated with the Dynabench platform. Our platform evaluates NLP models directly instead of relying on selfreported metrics or predictions on a single dataset. ... The results are reported in Table 2.
Researcher Affiliation	Collaboration	Facebook AI; Stanford University
Pseudocode	No	The paper describes calculations like the Marginal Rate of Substitution (MRS) and Dynascore using mathematical formulas (e.g., Equation 2) but does not provide any structured pseudocode or algorithm blocks.
Open Source Code	Yes	We refer the reviewer to the supplementary material for examples. ... (a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes]
Open Datasets	Yes	Table 1: Scoring and evaluation-as-a-service datasets used for each of the four Dynabench tasks. ... NLI SNLI [5], MNLI [74] matched and mismatched, ANLI rounds 1-3 [49] Respective dev sets; HANS [43], NLI stress tests [47], Winogender [62] recast as NLI [53] QA SQuAD dev [57], Adversarial QA [1] (Dynabench QA round 1) Adversarial QA dev, the 12 dev sets from MRQA shared tasks [18] Sentiment SST3 [66], Dyna Sent [55] Respective dev sets; Amazon Reviews [78] test and dev (10k subsample), Yelp Reviews [78] test and dev (10k subsample) Hate speech Learning From The Worst [71] Respective dev sets; Hate Check [61] ... We also compare to the majority baseline for the classiﬁcation tasks (using the training set majority) and for QA a baseline that simply returns the entire context.
Dataset Splits	Yes	Table 1: Scoring and evaluation-as-a-service datasets used for each of the four Dynabench tasks. ... NLI SNLI [5], MNLI [74] matched and mismatched, ANLI rounds 1-3 [49] Respective dev sets; HANS [43], NLI stress tests [47], Winogender [62] recast as NLI [53] QA SQuAD dev [57], Adversarial QA [1] (Dynabench QA round 1) Adversarial QA dev, the 12 dev sets from MRQA shared tasks [18] Sentiment SST3 [66], Dyna Sent [55] Respective dev sets; Amazon Reviews [78] test and dev (10k subsample), Yelp Reviews [78] test and dev (10k subsample) Hate speech Learning From The Worst [71] Respective dev sets; Hate Check [61]
Hardware Specification	Yes	For fair comparison, models are deployed on the exact same architecture. ... (d) Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [Yes]
Software Dependencies	No	The paper mentions several tools and libraries such as 'TextFlint evaluation toolkit [24]', 'spaCy', 'AllenNLP', and 'Hugging Face’s Transformers', but it does not specify exact version numbers for these or other software dependencies.
Experiment Setup	No	The paper states 'we finetune and evaluate BERT [12], RoBERTa [40], ALBERT [36], T5 [56] and DeBERTa [26] on all tasks', but it does not provide specific details on hyperparameters, training schedules, or other system-level settings used in the experiments within the main text.