FLASK: Fine-grained Language Model Evaluation based on Alignment Skill Sets
Authors: Seonghyeon Ye, Doyoung Kim, Sungdong Kim, Hyeonbin Hwang, Seungone Kim, Yongrae Jo, James Thorne, Juho Kim, Minjoon Seo
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We experimentally observe that the fine-graininess of evaluation is crucial for attaining a holistic view of model performance and increasing the reliability of the evaluation. Using FLASK, we compare multiple open-source and proprietary LLMs and observe a high correlation between model-based and human-based evaluations1. |
| Researcher Affiliation | Academia | Seonghyeon Ye Doyoung Kim Sungdong Kim Hyeonbin Hwang Seungone Kim Yongrae Jo James Thorne Juho Kim Minjoon Seo KAIST |
| Pseudocode | No | The paper describes processes and uses figures to illustrate them (e.g., Figure 1, Figure 21), but it does not include formal pseudocode or algorithm blocks. |
| Open Source Code | Yes | We publicly release the evaluation data and code implementation at www.omitted.link. |
| Open Datasets | Yes | We first collect input (instruction) and output (reference answer) pairs from various English NLP datasets, both multitask datasets (e.g. MMLU (Hendrycks et al., 2020)) and single-task datasets (e.g. GSM8K (Cobbe et al., 2021)). |
| Dataset Splits | No | The paper defines evaluation sets (e.g., the whole FLASK evaluation set, 200 randomly sampled instances for human evaluation, FLASK-HARD subset) but does not provide standard training/validation/test splits of the FLASK dataset itself for the purpose of training a model. FLASK is primarily used as an evaluation benchmark for existing LLMs. |
| Hardware Specification | No | The paper evaluates various LLMs but does not specify the hardware configurations (e.g., GPU models, CPU types, memory) used to run its own evaluation experiments. |
| Software Dependencies | No | The paper mentions specific versions of LLMs evaluated (e.g., "gpt-4-0613 version", "CLAUDE 1.0") but does not provide specific version numbers for ancillary software dependencies (e.g., programming languages, libraries, frameworks) used to implement and run their evaluation framework. |
| Experiment Setup | Yes | For model-based evaluation, we enforce the EVAL LM to generate a rationale before assigning a score, inspired by the effectiveness of Co T prompting (Wei et al., 2022b) for the evaluation of LLMs (Liu et al., 2023). ... For the response generation of each target model, we set the temperature to 0.7 and set the max generation sequences as 1024. |