What are the best Systems? New Perspectives on NLP Benchmarking

Authors: Pierre Colombo, Nathan Noiry, Ekhine Irurozki, Stephan Clémençon

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct extensive numerical experiments (on over 270k scores) to assess the soundness of our approach both on synthetic and real scores (e.g. GLUE, EXTREM, SEVAL, TAC, FLICKR).
Researcher Affiliation Academia Pierre Colombo L2S Centrale Supelec, France pierre.colombo@centralesupelec.frNathan Noiry S2A Telecom Paris, France nathan.noiry@telecom-paris.frEkhine Irurozki S2A Telecom Paris, France ekhine.irurozki@telecom-paris.frStephan Clémençon S2A Telecom Paris, France stephan.clemençon@telecom-paris.fr
Pseudocode Yes Borda s count. The Borda s count consists, from a set of permutations η1, . . . , ηL N corresponding to the ranking of N systems across L 1 tasks or instances, to sum the ranks of each system and then to rank the obtained sums. Formally, it1. Compute sumn := L P l=1 ηl n for every 1 n N, 2. Output η := Borda(η1, . . . , ηL) SN that ranks the sums, sumn (argsort(argsort(sum1, . . . , sum T ))).
Open Source Code Yes Our code and the collected data will be released to accelerate the adoption of what we think is a reliable evaluation method for multi-tasks and multi-criteria benchmarks. Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] Both in the supplementary and main paper
Open Datasets Yes We collect the results of GLUE [91], SGLUE [92]1 and XTREME [50]. For GLUE the dataset is composed of N = 105 systems that are evaluated on 9 different tasks... For SGLUE, the final dataset gathers scores from N = 24 systems that are evaluated on 10 different tasks... XTREM benchmark is composed of N = 15 systems and include tasks such as sentence classification... In this setting we focus on NLG evaluation... We focus on five different tasks: summary evaluation, image description, dialogue and translation. For summary evaluation, we use TAC08 [32], TAC10, TAC11 [69], RSUM [9] and SEVAL [41]. For sentence-based image description we rely on FLICKR [101] and for dialogue we use Persona Chat (PC) and Topical Chat (TC) [64]. Finally for machine translation, we rely on the multilingual quality estimation (MLQE) introduced in Ranasinghe et al. [81].
Dataset Splits Yes Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes] In the supplementary.
Hardware Specification Yes Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [Yes] In the supplementary.
Software Dependencies No The paper does not provide specific version numbers for any software dependencies used in the experiments within the main text.
Experiment Setup Yes Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes] In the supplementary.