DINGO: Towards Diverse and Fine-Grained Instruction-Following Evaluation
Authors: Zihui Gu, Xingwu Sun, Fengzong Lian, Zhanhui Kang, Chengzhong Xu, Ju Fan
AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through extensive experiments, we demonstrate that DINGO can not only provide more challenging and comprehensive evaluation for LLMs, but also provide task-level fine-grained directions to further improve LLMs. |
| Researcher Affiliation | Collaboration | 1Renmin Univerisity of China 2Tencent Inc. 3University of Macau {guzh, fanj}@ruc.edu.cn, sunxingwu01@gmail.com, {faxonlian, kegokang}@tencent.com, czxu@um.edu.mo |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | We release the DINGO dataset at Github1. 1https://github.com/ruc-datalab/DINGO |
| Open Datasets | Yes | We release the DINGO dataset at Github1. 1https://github.com/ruc-datalab/DINGO ... Table 3: The basic question source of DINGO. Word Problems GSM8K (Cobbe et al. 2021a) |
| Dataset Splits | No | The paper primarily focuses on evaluating LLMs using the DINGO dataset, which serves as the test set for their experiments. It does not explicitly define or provide training, validation, or test splits for DINGO itself or for the LLMs evaluated within the scope of this paper, as the evaluated LLMs are pre-existing. |
| Hardware Specification | No | The paper does not specify any particular hardware details such as GPU models, CPU models, or memory used for running the experiments. |
| Software Dependencies | No | The paper mentions using 'GPT-4' as a judge, but it does not specify any software names with version numbers for reproducibility (e.g., Python version, specific library versions). |
| Experiment Setup | No | The paper describes the evaluation methodology (LLM-as-a-judge, scoring criteria) but does not provide specific hyperparameters (e.g., learning rates, batch sizes, epochs) or detailed system-level training settings for the experimental setup. |