reproducibilityindex.ai

Prometheus: Inducing Fine-Grained Evaluation Capability in Language Models

Authors: Seungone Kim, Jamin Shin, Yejin Cho, Joel Jang, Shayne Longpre, Hwaran Lee, Sangdoo Yun, Seongjin Shin, Sungdong Kim, James Thorne, Minjoon Seo

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results show that PROMETHEUS scores a Pearson correlation of 0.897 with human evaluators when evaluating with 45 customized score rubrics, which is on par with GPT-4 (0.882), and greatly outperforms Chat GPT (0.392).
Researcher Affiliation	Collaboration	Seungone Kim1,2 Jamin Shin2,3 Yejin Cho1 Joel Jang4 Shayne Longpre5 Hwaran Lee2,3 Sangdoo Yun2,3 Seongjin Shin3 Sungdong Kim1,2,3 James Thorne1 Minjoon Seo1 1KAIST AI 2NAVER AI Lab 3NAVER Cloud 4University of Washington 5MIT
Pseudocode	No	The paper describes processes like dataset construction but does not present any structured pseudocode or algorithm blocks.
Open Source Code	Yes	We will open-source our code, dataset, and model 1. 1https://kaistai.github.io/prometheus/
Open Datasets	Yes	We first construct the FEEDBACK COLLECTION, a new dataset that consists of 1K fine-grained score rubrics, 20K instructions, and 100K responses and language feedback generated by GPT-4. Using the FEEDBACK COLLECTION, we train PROMETHEUS, a 13B evaluator LLM that can assess any given long-form text based on customized score rubric provided by the user. We will open-source our code, dataset, and model 1.
Dataset Splits	No	The paper describes using the FEEDBACK COLLECTION for training and FEEDBACK BENCH for evaluation, but it does not specify explicit training/validation/test dataset splits (e.g., percentages or counts) for reproduction.
Hardware Specification	Yes	We use 8x A100 (80GB) GPUs to train our models with Py Torch Fully-Sharded Data Parallel (FSDP) option.
Software Dependencies	No	The paper mentions 'Py Torch' and refers to the 'official Llama2 fine-tuning code' but does not specify explicit version numbers for these software dependencies.
Experiment Setup	Yes	The hyper-parameters we used are the basic settings in the fine-tuning code except for the training batch size which was set according to the model size: for 7B models we used 28 and for 13B models we used 20 to fully leverage GPU memory. The detailed hyper-parameters are shown in Table 8. For inference, we use the hyper-parameters as shown in Table 9.