Dynaboard: An Evaluation-As-A-Service Platform for Holistic Next-Generation Benchmarking
Authors: Zhiyi Ma, Kawin Ethayarajh, Tristan Thrush, Somya Jain, Ledell Wu, Robin Jia, Christopher Potts, Adina Williams, Douwe Kiela
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We introduce Dynaboard, an evaluation-as-a-service framework for hosting benchmarks and conducting holistic model comparison, integrated with the Dynabench platform. Our platform evaluates NLP models directly instead of relying on selfreported metrics or predictions on a single dataset. ... The results are reported in Table 2. |
| Researcher Affiliation | Collaboration | Facebook AI; Stanford University |
| Pseudocode | No | The paper describes calculations like the Marginal Rate of Substitution (MRS) and Dynascore using mathematical formulas (e.g., Equation 2) but does not provide any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | We refer the reviewer to the supplementary material for examples. ... (a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] |
| Open Datasets | Yes | Table 1: Scoring and evaluation-as-a-service datasets used for each of the four Dynabench tasks. ... NLI SNLI [5], MNLI [74] matched and mismatched, ANLI rounds 1-3 [49] Respective dev sets; HANS [43], NLI stress tests [47], Winogender [62] recast as NLI [53] QA SQuAD dev [57], Adversarial QA [1] (Dynabench QA round 1) Adversarial QA dev, the 12 dev sets from MRQA shared tasks [18] Sentiment SST3 [66], Dyna Sent [55] Respective dev sets; Amazon Reviews [78] test and dev (10k subsample), Yelp Reviews [78] test and dev (10k subsample) Hate speech Learning From The Worst [71] Respective dev sets; Hate Check [61] ... We also compare to the majority baseline for the classification tasks (using the training set majority) and for QA a baseline that simply returns the entire context. |
| Dataset Splits | Yes | Table 1: Scoring and evaluation-as-a-service datasets used for each of the four Dynabench tasks. ... NLI SNLI [5], MNLI [74] matched and mismatched, ANLI rounds 1-3 [49] Respective dev sets; HANS [43], NLI stress tests [47], Winogender [62] recast as NLI [53] QA SQuAD dev [57], Adversarial QA [1] (Dynabench QA round 1) Adversarial QA dev, the 12 dev sets from MRQA shared tasks [18] Sentiment SST3 [66], Dyna Sent [55] Respective dev sets; Amazon Reviews [78] test and dev (10k subsample), Yelp Reviews [78] test and dev (10k subsample) Hate speech Learning From The Worst [71] Respective dev sets; Hate Check [61] |
| Hardware Specification | Yes | For fair comparison, models are deployed on the exact same architecture. ... (d) Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [Yes] |
| Software Dependencies | No | The paper mentions several tools and libraries such as 'TextFlint evaluation toolkit [24]', 'spaCy', 'AllenNLP', and 'Hugging Face’s Transformers', but it does not specify exact version numbers for these or other software dependencies. |
| Experiment Setup | No | The paper states 'we finetune and evaluate BERT [12], RoBERTa [40], ALBERT [36], T5 [56] and DeBERTa [26] on all tasks', but it does not provide specific details on hyperparameters, training schedules, or other system-level settings used in the experiments within the main text. |