reproducibilityindex.ai

CriticEval: Evaluating Large-scale Language Model as Critic

Authors: Tian Lan, Wenwei Zhang, Chen Xu, Heyan Huang, Dahua Lin, Kai Chen, Xian-Ling Mao

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive evaluations of open-source and closed-source LLMs first validate the reliability of evaluation in CRITICEVAL. Then, experimental results demonstrate the promising potential of open-source LLMs, the effectiveness of critique datasets and several intriguing relationships between the critique ability and some critical factors, including task types, response qualities and critique dimensions.
Researcher Affiliation	Collaboration	Tian Lan1 Wenwei Zhang2 Chen Xu4 Heyan Huang1 Dahua Lin2,3,5 Kai Chen2 Xian-Ling Mao1 1School of Computer Science and Technology, Beijing Institute of Technology 2Shanghai AI Laboratory 3MMLab, The Chinese University of Hong Kong 4Key Laboratory of Brain Health Intelligent Evaluation and Intervention, Ministry of Education, Beijing Institute of Technology 5CPII under Inno HK
Pseudocode	No	The paper describes a "human-in-the-loop data construction pipeline as shown in Figure 2" with steps, but it does not present this as formal pseudocode or an algorithm block.
Open Source Code	Yes	https://github.com/open-compass/CriticEval
Open Datasets	Yes	Task inputs for 9 distinct tasks are collected to evaluate critique capabilities comprehensively (Step 1 in Figure 2). Specifically, CRITICEVAL includes three widely used tasks for evaluating critique ability: (1) representative classical language tasks: summary [39], translation [40], and question-answering [41]; (2) LLM alignment: general chat scenarios [19] and harmlessness cases [35]; (3) reasoning and code capabilities: math reasoning with chain-of-thought (Co T) and program-of-thought (Po T), and coding with and without execution results. We hereinafter refer to code w/ execution as Code Exec and code w/o execution as Code NE . For each task, we collect around 100 task inputs from the test sets of some widely used benchmark datasets to ensure the task input quality and avoid data contamination. Please refer to Appendix D for more details about the data source.
Dataset Splits	Yes	The statistics of CRITICEVAL in the test and dev set are shown in Table 15.
Hardware Specification	Yes	The inference procedures of all these evaluated LLMs in this paper are conducted in an A800 server with 8 GPU cards, each with 80G CUDA memory.
Software Dependencies	Yes	The v LLM [82] and LMDeploy [83] packages are used to speed up the inference
Experiment Setup	Yes	The prompt templates for LLMs on critique dimensions are shown in Appendix I with score rubrics listed in Figure 18 in Appendix H.3.