reproducibilityindex.ai

IDGen: Item Discrimination Induced Prompt Generation for LLM Evaluation

Authors: Fan Lin, Shuyi Xie, Yong Dai, Wenlin Yao, TianJiao Lang, Yu Zhang

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We apply our generated data to evaluate five SOTA models. Our data achieves an average score of 51.92, accompanied by a variance of 10.06. By contrast, previous works (i.e., SELF-INSTRUCT and Wizard LM) obtain an average score exceeding 67, with a variance below 3.2. The results demonstrate that the data generated by our framework is more challenging and discriminative compared to previous works. We will release a dataset of over 3,000 carefully crafted prompts to facilitate evaluation research of LLMs.
Researcher Affiliation	Collaboration	Fan Lin1,2 , Shuyi Xie2 , Yong Dai2 , Wenlin Yao2 , Tianjiao Lang2, Yu Zhang1 . 1South East University, Nanjing, China, 2Tencent, Shenzhen, China
Pseudocode	No	The paper describes the framework and methods textually and with a diagram (Figure 1), but does not contain a formal pseudocode or algorithm block.
Open Source Code	Yes	Code and data are available at https://github.com/DUTlf/IDGen.git
Open Datasets	Yes	The English instances include 175 sourced from the SELF-INSTRUCT dataset [16] and the remainder from the Alpaca dataset [18].
Dataset Splits	No	The paper does not explicitly state training/validation/test dataset splits with percentages or sample counts for its experiments. It refers to human-evaluated questions for model validation, but not a dataset split.
Hardware Specification	No	The paper does not explicitly describe the hardware used for its experiments.
Software Dependencies	No	The paper mentions various LLM models (e.g., Hunyuan, GPT-4, Qwen, Baichuan2-13B) used, but does not provide specific version numbers for software dependencies or libraries required to replicate the experiments.
Experiment Setup	Yes	In this section, we first introduce the experimental setup, including the baselines and the seed data. Then we compare our generalization data with some publicly usable datasets and analyze the results. Subsequently, we assess the usability of our data, as well as the discrimination indexes and difficulty score, and provide relevant analysis. Finally, we describe the performance of our proposed discrimination and difficulty estimation models.