DINGO: Towards Diverse and Fine-Grained Instruction-Following Evaluation

Authors: Zihui Gu, Xingwu Sun, Fengzong Lian, Zhanhui Kang, Chengzhong Xu, Ju Fan

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through extensive experiments, we demonstrate that DINGO can not only provide more challenging and comprehensive evaluation for LLMs, but also provide task-level fine-grained directions to further improve LLMs.
Researcher Affiliation Collaboration 1Renmin Univerisity of China 2Tencent Inc. 3University of Macau {guzh, fanj}@ruc.edu.cn, sunxingwu01@gmail.com, {faxonlian, kegokang}@tencent.com, czxu@um.edu.mo
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code Yes We release the DINGO dataset at Github1. 1https://github.com/ruc-datalab/DINGO
Open Datasets Yes We release the DINGO dataset at Github1. 1https://github.com/ruc-datalab/DINGO ... Table 3: The basic question source of DINGO. Word Problems GSM8K (Cobbe et al. 2021a)
Dataset Splits No The paper primarily focuses on evaluating LLMs using the DINGO dataset, which serves as the test set for their experiments. It does not explicitly define or provide training, validation, or test splits for DINGO itself or for the LLMs evaluated within the scope of this paper, as the evaluated LLMs are pre-existing.
Hardware Specification No The paper does not specify any particular hardware details such as GPU models, CPU models, or memory used for running the experiments.
Software Dependencies No The paper mentions using 'GPT-4' as a judge, but it does not specify any software names with version numbers for reproducibility (e.g., Python version, specific library versions).
Experiment Setup No The paper describes the evaluation methodology (LLM-as-a-judge, scoring criteria) but does not provide specific hyperparameters (e.g., learning rates, batch sizes, epochs) or detailed system-level training settings for the experimental setup.