reproducibilityindex.ai

FLASK: Fine-grained Language Model Evaluation based on Alignment Skill Sets

Authors: Seonghyeon Ye, Doyoung Kim, Sungdong Kim, Hyeonbin Hwang, Seungone Kim, Yongrae Jo, James Thorne, Juho Kim, Minjoon Seo

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We experimentally observe that the fine-graininess of evaluation is crucial for attaining a holistic view of model performance and increasing the reliability of the evaluation. Using FLASK, we compare multiple open-source and proprietary LLMs and observe a high correlation between model-based and human-based evaluations1.
Researcher Affiliation	Academia	Seonghyeon Ye Doyoung Kim Sungdong Kim Hyeonbin Hwang Seungone Kim Yongrae Jo James Thorne Juho Kim Minjoon Seo KAIST
Pseudocode	No	The paper describes processes and uses figures to illustrate them (e.g., Figure 1, Figure 21), but it does not include formal pseudocode or algorithm blocks.
Open Source Code	Yes	We publicly release the evaluation data and code implementation at www.omitted.link.
Open Datasets	Yes	We first collect input (instruction) and output (reference answer) pairs from various English NLP datasets, both multitask datasets (e.g. MMLU (Hendrycks et al., 2020)) and single-task datasets (e.g. GSM8K (Cobbe et al., 2021)).
Dataset Splits	No	The paper defines evaluation sets (e.g., the whole FLASK evaluation set, 200 randomly sampled instances for human evaluation, FLASK-HARD subset) but does not provide standard training/validation/test splits of the FLASK dataset itself for the purpose of training a model. FLASK is primarily used as an evaluation benchmark for existing LLMs.
Hardware Specification	No	The paper evaluates various LLMs but does not specify the hardware configurations (e.g., GPU models, CPU types, memory) used to run its own evaluation experiments.
Software Dependencies	No	The paper mentions specific versions of LLMs evaluated (e.g., "gpt-4-0613 version", "CLAUDE 1.0") but does not provide specific version numbers for ancillary software dependencies (e.g., programming languages, libraries, frameworks) used to implement and run their evaluation framework.
Experiment Setup	Yes	For model-based evaluation, we enforce the EVAL LM to generate a rationale before assigning a score, inspired by the effectiveness of Co T prompting (Wei et al., 2022b) for the evaluation of LLMs (Liu et al., 2023). ... For the response generation of each target model, we set the temperature to 0.7 and set the max generation sequences as 1024.