Dynaboard: An Evaluation-As-A-Service Platform for Holistic Next-Generation Benchmarking

Authors: Zhiyi Ma, Kawin Ethayarajh, Tristan Thrush, Somya Jain, Ledell Wu, Robin Jia, Christopher Potts, Adina Williams, Douwe Kiela

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We introduce Dynaboard, an evaluation-as-a-service framework for hosting benchmarks and conducting holistic model comparison, integrated with the Dynabench platform. Our platform evaluates NLP models directly instead of relying on selfreported metrics or predictions on a single dataset. ... The results are reported in Table 2.
Researcher Affiliation Collaboration Facebook AI; Stanford University
Pseudocode No The paper describes calculations like the Marginal Rate of Substitution (MRS) and Dynascore using mathematical formulas (e.g., Equation 2) but does not provide any structured pseudocode or algorithm blocks.
Open Source Code Yes We refer the reviewer to the supplementary material for examples. ... (a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes]
Open Datasets Yes Table 1: Scoring and evaluation-as-a-service datasets used for each of the four Dynabench tasks. ... NLI SNLI [5], MNLI [74] matched and mismatched, ANLI rounds 1-3 [49] Respective dev sets; HANS [43], NLI stress tests [47], Winogender [62] recast as NLI [53] QA SQuAD dev [57], Adversarial QA [1] (Dynabench QA round 1) Adversarial QA dev, the 12 dev sets from MRQA shared tasks [18] Sentiment SST3 [66], Dyna Sent [55] Respective dev sets; Amazon Reviews [78] test and dev (10k subsample), Yelp Reviews [78] test and dev (10k subsample) Hate speech Learning From The Worst [71] Respective dev sets; Hate Check [61] ... We also compare to the majority baseline for the classification tasks (using the training set majority) and for QA a baseline that simply returns the entire context.
Dataset Splits Yes Table 1: Scoring and evaluation-as-a-service datasets used for each of the four Dynabench tasks. ... NLI SNLI [5], MNLI [74] matched and mismatched, ANLI rounds 1-3 [49] Respective dev sets; HANS [43], NLI stress tests [47], Winogender [62] recast as NLI [53] QA SQuAD dev [57], Adversarial QA [1] (Dynabench QA round 1) Adversarial QA dev, the 12 dev sets from MRQA shared tasks [18] Sentiment SST3 [66], Dyna Sent [55] Respective dev sets; Amazon Reviews [78] test and dev (10k subsample), Yelp Reviews [78] test and dev (10k subsample) Hate speech Learning From The Worst [71] Respective dev sets; Hate Check [61]
Hardware Specification Yes For fair comparison, models are deployed on the exact same architecture. ... (d) Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [Yes]
Software Dependencies No The paper mentions several tools and libraries such as 'TextFlint evaluation toolkit [24]', 'spaCy', 'AllenNLP', and 'Hugging Face’s Transformers', but it does not specify exact version numbers for these or other software dependencies.
Experiment Setup No The paper states 'we finetune and evaluate BERT [12], RoBERTa [40], ALBERT [36], T5 [56] and DeBERTa [26] on all tasks', but it does not provide specific details on hyperparameters, training schedules, or other system-level settings used in the experiments within the main text.