IDGen: Item Discrimination Induced Prompt Generation for LLM Evaluation
Authors: Fan Lin, Shuyi Xie, Yong Dai, Wenlin Yao, TianJiao Lang, Yu Zhang
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We apply our generated data to evaluate five SOTA models. Our data achieves an average score of 51.92, accompanied by a variance of 10.06. By contrast, previous works (i.e., SELF-INSTRUCT and Wizard LM) obtain an average score exceeding 67, with a variance below 3.2. The results demonstrate that the data generated by our framework is more challenging and discriminative compared to previous works. We will release a dataset of over 3,000 carefully crafted prompts to facilitate evaluation research of LLMs. |
| Researcher Affiliation | Collaboration | Fan Lin1,2 , Shuyi Xie2 , Yong Dai2 , Wenlin Yao2 , Tianjiao Lang2, Yu Zhang1 . 1South East University, Nanjing, China, 2Tencent, Shenzhen, China |
| Pseudocode | No | The paper describes the framework and methods textually and with a diagram (Figure 1), but does not contain a formal pseudocode or algorithm block. |
| Open Source Code | Yes | Code and data are available at https://github.com/DUTlf/IDGen.git |
| Open Datasets | Yes | The English instances include 175 sourced from the SELF-INSTRUCT dataset [16] and the remainder from the Alpaca dataset [18]. |
| Dataset Splits | No | The paper does not explicitly state training/validation/test dataset splits with percentages or sample counts for its experiments. It refers to human-evaluated questions for model validation, but not a dataset split. |
| Hardware Specification | No | The paper does not explicitly describe the hardware used for its experiments. |
| Software Dependencies | No | The paper mentions various LLM models (e.g., Hunyuan, GPT-4, Qwen, Baichuan2-13B) used, but does not provide specific version numbers for software dependencies or libraries required to replicate the experiments. |
| Experiment Setup | Yes | In this section, we first introduce the experimental setup, including the baselines and the seed data. Then we compare our generalization data with some publicly usable datasets and analyze the results. Subsequently, we assess the usability of our data, as well as the discrimination indexes and difficulty score, and provide relevant analysis. Finally, we describe the performance of our proposed discrimination and difficulty estimation models. |