What are the best Systems? New Perspectives on NLP Benchmarking
Authors: Pierre Colombo, Nathan Noiry, Ekhine Irurozki, Stephan Clémençon
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct extensive numerical experiments (on over 270k scores) to assess the soundness of our approach both on synthetic and real scores (e.g. GLUE, EXTREM, SEVAL, TAC, FLICKR). |
| Researcher Affiliation | Academia | Pierre Colombo L2S Centrale Supelec, France pierre.colombo@centralesupelec.frNathan Noiry S2A Telecom Paris, France nathan.noiry@telecom-paris.frEkhine Irurozki S2A Telecom Paris, France ekhine.irurozki@telecom-paris.frStephan Clémençon S2A Telecom Paris, France stephan.clemençon@telecom-paris.fr |
| Pseudocode | Yes | Borda s count. The Borda s count consists, from a set of permutations η1, . . . , ηL N corresponding to the ranking of N systems across L 1 tasks or instances, to sum the ranks of each system and then to rank the obtained sums. Formally, it1. Compute sumn := L P l=1 ηl n for every 1 n N, 2. Output η := Borda(η1, . . . , ηL) SN that ranks the sums, sumn (argsort(argsort(sum1, . . . , sum T ))). |
| Open Source Code | Yes | Our code and the collected data will be released to accelerate the adoption of what we think is a reliable evaluation method for multi-tasks and multi-criteria benchmarks. Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] Both in the supplementary and main paper |
| Open Datasets | Yes | We collect the results of GLUE [91], SGLUE [92]1 and XTREME [50]. For GLUE the dataset is composed of N = 105 systems that are evaluated on 9 different tasks... For SGLUE, the final dataset gathers scores from N = 24 systems that are evaluated on 10 different tasks... XTREM benchmark is composed of N = 15 systems and include tasks such as sentence classification... In this setting we focus on NLG evaluation... We focus on five different tasks: summary evaluation, image description, dialogue and translation. For summary evaluation, we use TAC08 [32], TAC10, TAC11 [69], RSUM [9] and SEVAL [41]. For sentence-based image description we rely on FLICKR [101] and for dialogue we use Persona Chat (PC) and Topical Chat (TC) [64]. Finally for machine translation, we rely on the multilingual quality estimation (MLQE) introduced in Ranasinghe et al. [81]. |
| Dataset Splits | Yes | Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes] In the supplementary. |
| Hardware Specification | Yes | Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [Yes] In the supplementary. |
| Software Dependencies | No | The paper does not provide specific version numbers for any software dependencies used in the experiments within the main text. |
| Experiment Setup | Yes | Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes] In the supplementary. |