reproducibilityindex.ai

What are the best Systems? New Perspectives on NLP Benchmarking

Authors: Pierre Colombo, Nathan Noiry, Ekhine Irurozki, Stephan Clémençon

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct extensive numerical experiments (on over 270k scores) to assess the soundness of our approach both on synthetic and real scores (e.g. GLUE, EXTREM, SEVAL, TAC, FLICKR).
Researcher Affiliation	Academia	Pierre Colombo L2S Centrale Supelec, France pierre.colombo@centralesupelec.frNathan Noiry S2A Telecom Paris, France nathan.noiry@telecom-paris.frEkhine Irurozki S2A Telecom Paris, France ekhine.irurozki@telecom-paris.frStephan Clémençon S2A Telecom Paris, France stephan.clemençon@telecom-paris.fr
Pseudocode	Yes	Borda s count. The Borda s count consists, from a set of permutations η1, . . . , ηL N corresponding to the ranking of N systems across L 1 tasks or instances, to sum the ranks of each system and then to rank the obtained sums. Formally, it1. Compute sumn := L P l=1 ηl n for every 1 n N, 2. Output η := Borda(η1, . . . , ηL) SN that ranks the sums, sumn (argsort(argsort(sum1, . . . , sum T ))).
Open Source Code	Yes	Our code and the collected data will be released to accelerate the adoption of what we think is a reliable evaluation method for multi-tasks and multi-criteria benchmarks. Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] Both in the supplementary and main paper
Open Datasets	Yes	We collect the results of GLUE [91], SGLUE [92]1 and XTREME [50]. For GLUE the dataset is composed of N = 105 systems that are evaluated on 9 different tasks... For SGLUE, the ﬁnal dataset gathers scores from N = 24 systems that are evaluated on 10 different tasks... XTREM benchmark is composed of N = 15 systems and include tasks such as sentence classiﬁcation... In this setting we focus on NLG evaluation... We focus on ﬁve different tasks: summary evaluation, image description, dialogue and translation. For summary evaluation, we use TAC08 [32], TAC10, TAC11 [69], RSUM [9] and SEVAL [41]. For sentence-based image description we rely on FLICKR [101] and for dialogue we use Persona Chat (PC) and Topical Chat (TC) [64]. Finally for machine translation, we rely on the multilingual quality estimation (MLQE) introduced in Ranasinghe et al. [81].
Dataset Splits	Yes	Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes] In the supplementary.
Hardware Specification	Yes	Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [Yes] In the supplementary.
Software Dependencies	No	The paper does not provide specific version numbers for any software dependencies used in the experiments within the main text.
Experiment Setup	Yes	Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes] In the supplementary.