Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

PromptBench: A Unified Library for Evaluation of Large Language Models

Authors: Kaijie Zhu, Qinlin Zhao, Hao Chen, Jindong Wang, Xing Xie

JMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	The evaluation of large language models (LLMs) is crucial to assess their performance and mitigate potential security risks. In this paper, we introduce Prompt Bench, a uniﬁed library to evaluate LLMs. It consists of several key components that can be easily used and extended by researchers: prompt construction, prompt engineering, dataset and model loading, adversarial prompt attack, dynamic evaluation protocols, and analysis tools. Prompt Bench is designed as an open, general, and ﬂexible codebase for research purpose. It aims to facilitate original study in creating new benchmarks, deploying downstream applications, and designing new evaluation protocols. The code is available at: https://github.com/microsoft/promptbench and will be continuously supported. Keywords: Evaluation, large language models, framework
Researcher Affiliation	Collaboration	Kaijie Zhu1,2 , Qinlin Zhao1,3 , Hao Chen4, Jindong Wang1 , Xing Xie1 1Microsoft Research Asia 2Institute of Automation, Chinese Academy of Sciences 3University of Science and Technology of China 4Carnegie Mellon University Editor: Zeyi Wen The evaluation of large language models (LLMs) is crucial to assess their performance and mitigate potential security risks. In this paper, we introduce Prompt Bench, a uniﬁed library to evaluate LLMs. ... . Corresponding author: Jindong Wang (EMAIL).
Pseudocode	No	The paper provides actual Python code snippets in Figure 2 under the section '2.2 Evaluation pipeline' rather than structured pseudocode or an algorithm block.
Open Source Code	Yes	The code is available at: https://github.com/microsoft/promptbench and will be continuously supported.
Open Datasets	Yes	GLUE (Wang et al., 2019): The GLUE benchmark (General Language Understanding Evaluation) oﬀers a suite of tasks to evaluate the capability of NLP models in understanding language. For this research, we employed 8 speciﬁc tasks: Sentiment Analysis (SST-2 (Socher et al., 2013)), Grammar Correctness (Co LA (Warstadt et al., 2018)), Identifying Duplicate Sentences (QQP (Wang et al., 2017), MRPC (Dolan and Brockett, 2005)), and various Natural Language Inference tasks (MNLI (Williams et al., 2018), QNLI (Wang et al., 2019), RTE (Wang et al., 2019), WNLI (Levesque et al., 2012)).
Dataset Splits	Yes	GSM8K (Cobbe et al., 2021): The GSM8K dataset is a collection of 8.5K highquality, linguistically diverse grade school math word problems. It was created by human problem writers and is divided into 7.5K training problems and 1K test problems. ... QASC (Khot et al., 2020): ...divided into 8,134 for training, 926 for development, and 920 for testing .
Hardware Specification	No	The paper does not explicitly describe the specific hardware (e.g., GPU models, CPU types) used for running its experiments.
Software Dependencies	No	The paper mentions 'pip install promptbench' and 'most LLMs implemented in Huggingface' as general software components, but does not provide specific version numbers for Python, Huggingface, or any other key libraries.
Experiment Setup	No	The paper describes the Prompt Bench library and its capabilities, and presents benchmark results, but it does not provide specific experimental setup details such as hyperparameters (e.g., learning rate, batch size, number of epochs) used to obtain these results.