reproducibilityindex.ai

DINGO: Towards Diverse and Fine-Grained Instruction-Following Evaluation

Authors: Zihui Gu, Xingwu Sun, Fengzong Lian, Zhanhui Kang, Chengzhong Xu, Ju Fan

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Through extensive experiments, we demonstrate that DINGO can not only provide more challenging and comprehensive evaluation for LLMs, but also provide task-level fine-grained directions to further improve LLMs.
Researcher Affiliation	Collaboration	1Renmin Univerisity of China 2Tencent Inc. 3University of Macau {guzh, fanj}@ruc.edu.cn, sunxingwu01@gmail.com, {faxonlian, kegokang}@tencent.com, czxu@um.edu.mo
Pseudocode	No	The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code	Yes	We release the DINGO dataset at Github1. 1https://github.com/ruc-datalab/DINGO
Open Datasets	Yes	We release the DINGO dataset at Github1. 1https://github.com/ruc-datalab/DINGO ... Table 3: The basic question source of DINGO. Word Problems GSM8K (Cobbe et al. 2021a)
Dataset Splits	No	The paper primarily focuses on evaluating LLMs using the DINGO dataset, which serves as the test set for their experiments. It does not explicitly define or provide training, validation, or test splits for DINGO itself or for the LLMs evaluated within the scope of this paper, as the evaluated LLMs are pre-existing.
Hardware Specification	No	The paper does not specify any particular hardware details such as GPU models, CPU models, or memory used for running the experiments.
Software Dependencies	No	The paper mentions using 'GPT-4' as a judge, but it does not specify any software names with version numbers for reproducibility (e.g., Python version, specific library versions).
Experiment Setup	No	The paper describes the evaluation methodology (LLM-as-a-judge, scoring criteria) but does not provide specific hyperparameters (e.g., learning rates, batch sizes, epochs) or detailed system-level training settings for the experimental setup.