reproducibilityindex.ai

On the Worst Prompt Performance of Large Language Models

Authors: Bowen Cao, Deng Cai, Zhisong Zhang, Yuexian Zou, Wai Lam

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments on ROBUSTALPACAEVAL with Chat GPT and six open-source LLMs from the Llama, Mistral, and Gemma families uncover substantial variability in model performance
Researcher Affiliation	Collaboration	Bowen Cao , Deng Cai , Zhisong Zhang Yuexian Zou Wai Lam The Chinese University of Hong Kong Tencent AI Lab Peking University
Pseudocode	No	The paper describes methods with mathematical formulas but does not include structured pseudocode or algorithm blocks.
Open Source Code	Yes	Data and code are available at https://github.com/bwcao/Robust Alpaca Eval.
Open Datasets	Yes	Our benchmark is based on Tiny Alpace Eval (Polo et al., 2024), which is a condensed subset of the Alpaca Eval (Li et al., 2023) benchmark
Dataset Splits	Yes	We implement two training-test set partitioning strategies: (i) Intra: Dividing prompts within each case into training and testing sets at a 3:1 ratio. (ii) Inter: Dividing all cases into training and testing sets at a 3:1 ratio.
Hardware Specification	No	The paper discusses the models used (Chat GPT, Llama, Mistral, Gemma families) but does not specify the hardware (e.g., GPU/CPU models, memory) used to run the experiments. The NeurIPS checklist confirms this, stating, 'Given that our experiments do not require substantial computational resources, we have not specifically outlined the computer resources needed for the experiments.'
Software Dependencies	No	The paper does not provide specific version numbers for any software dependencies or libraries used in the implementation of the experiments.
Experiment Setup	Yes	We train a reward model (a 3-layer MLP in practice)... We train models based on Lo RA and stop training at the checkpoint from which C(X, Y ) starts to increase.