Prometheus: Inducing Fine-Grained Evaluation Capability in Language Models
Authors: Seungone Kim, Jamin Shin, Yejin Cho, Joel Jang, Shayne Longpre, Hwaran Lee, Sangdoo Yun, Seongjin Shin, Sungdong Kim, James Thorne, Minjoon Seo
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results show that PROMETHEUS scores a Pearson correlation of 0.897 with human evaluators when evaluating with 45 customized score rubrics, which is on par with GPT-4 (0.882), and greatly outperforms Chat GPT (0.392). |
| Researcher Affiliation | Collaboration | Seungone Kim1,2 Jamin Shin2,3 Yejin Cho1 Joel Jang4 Shayne Longpre5 Hwaran Lee2,3 Sangdoo Yun2,3 Seongjin Shin3 Sungdong Kim1,2,3 James Thorne1 Minjoon Seo1 1KAIST AI 2NAVER AI Lab 3NAVER Cloud 4University of Washington 5MIT |
| Pseudocode | No | The paper describes processes like dataset construction but does not present any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | We will open-source our code, dataset, and model 1. 1https://kaistai.github.io/prometheus/ |
| Open Datasets | Yes | We first construct the FEEDBACK COLLECTION, a new dataset that consists of 1K fine-grained score rubrics, 20K instructions, and 100K responses and language feedback generated by GPT-4. Using the FEEDBACK COLLECTION, we train PROMETHEUS, a 13B evaluator LLM that can assess any given long-form text based on customized score rubric provided by the user. We will open-source our code, dataset, and model 1. |
| Dataset Splits | No | The paper describes using the FEEDBACK COLLECTION for training and FEEDBACK BENCH for evaluation, but it does not specify explicit training/validation/test dataset splits (e.g., percentages or counts) for reproduction. |
| Hardware Specification | Yes | We use 8x A100 (80GB) GPUs to train our models with Py Torch Fully-Sharded Data Parallel (FSDP) option. |
| Software Dependencies | No | The paper mentions 'Py Torch' and refers to the 'official Llama2 fine-tuning code' but does not specify explicit version numbers for these software dependencies. |
| Experiment Setup | Yes | The hyper-parameters we used are the basic settings in the fine-tuning code except for the training batch size which was set according to the model size: for 7B models we used 28 and for 13B models we used 20 to fully leverage GPU memory. The detailed hyper-parameters are shown in Table 8. For inference, we use the hyper-parameters as shown in Table 9. |