Elo Uncovered: Robustness and Best Practices in Language Model Evaluation
Authors: Meriem Boubdir, Edward Kim, Beyza Ermis, Sara Hooker, Marzieh Fadaee
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct an extensive evaluation of Elo behavior across simulated and real-world scenarios, demonstrating that individual Elo computations can exhibit significant volatility. We show that both axioms are not always satisfied, raising questions about the reliability of current comparative evaluations of LLMs. |
| Researcher Affiliation | Industry | Meriem Boubdir Cohere For AI meri.boubdir@gmail.com Edward Kim Cohere edward@cohere.com Beyza Ermis Cohere For AI beyza@cohere.com Sara Hooker Cohere For AI sarahooker@cohere.com Marzieh Fadaee Cohere For AI marzieh@cohere.com |
| Pseudocode | No | The paper provides mathematical formulas for the Elo algorithm (Equations 1 and 2) but does not include a distinct pseudocode block or algorithm section. |
| Open Source Code | No | The paper's NeurIPS checklist states that 'We provide thorough details on generating synthetic human feedback data... These comprehensive instructions should enable other researchers to effectively replicate our experiments and verify our results.' However, it does not provide a direct link to a code repository or explicitly state that the source code for the methodology is released. |
| Open Datasets | Yes | We use the LMSYS Chatbot Arena dataset [34], an open-source collection of human preference data derived from unique users interactions with two distinct models responding to a set of userdefined prompts. |
| Dataset Splits | No | The paper describes generating synthetic data and sampling from real-world data (e.g., 'sampling a fixed number, Nsample, for each pair'), but it does not specify traditional training, validation, and test dataset splits as typically used for training machine learning models. |
| Hardware Specification | Yes | We generated completions in batches sized between 8 and 50, depending on the size of each model evaluated, using an Nvidia A100 GPU with 40GB memory for efficient computation. Inference was performed in a Bfloat16 setting to reduce memory usage, as deatailed in referenced work. |
| Software Dependencies | No | The paper mentions 'Bfloat16 setting' for inference but does not specify any software names with version numbers (e.g., Python, PyTorch, CUDA, or specific libraries). |
| Experiment Setup | Yes | Experimental Setup To quantify the effect of match-up ordering, we generate a baseline sequence of Ngames = 1000 match outcomes... Nperms is varied from a minimum of 1 to a maximum of 10k... We extend our previous approach by conducting tests across a range of winning probabilities and multiple K-factor values (1, 8, 16, 32, 64). |