On the Worst Prompt Performance of Large Language Models

Authors: Bowen Cao, Deng Cai, Zhisong Zhang, Yuexian Zou, Wai Lam

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on ROBUSTALPACAEVAL with Chat GPT and six open-source LLMs from the Llama, Mistral, and Gemma families uncover substantial variability in model performance
Researcher Affiliation Collaboration Bowen Cao , Deng Cai , Zhisong Zhang Yuexian Zou Wai Lam The Chinese University of Hong Kong Tencent AI Lab Peking University
Pseudocode No The paper describes methods with mathematical formulas but does not include structured pseudocode or algorithm blocks.
Open Source Code Yes Data and code are available at https://github.com/bwcao/Robust Alpaca Eval.
Open Datasets Yes Our benchmark is based on Tiny Alpace Eval (Polo et al., 2024), which is a condensed subset of the Alpaca Eval (Li et al., 2023) benchmark
Dataset Splits Yes We implement two training-test set partitioning strategies: (i) Intra: Dividing prompts within each case into training and testing sets at a 3:1 ratio. (ii) Inter: Dividing all cases into training and testing sets at a 3:1 ratio.
Hardware Specification No The paper discusses the models used (Chat GPT, Llama, Mistral, Gemma families) but does not specify the hardware (e.g., GPU/CPU models, memory) used to run the experiments. The NeurIPS checklist confirms this, stating, 'Given that our experiments do not require substantial computational resources, we have not specifically outlined the computer resources needed for the experiments.'
Software Dependencies No The paper does not provide specific version numbers for any software dependencies or libraries used in the implementation of the experiments.
Experiment Setup Yes We train a reward model (a 3-layer MLP in practice)... We train models based on Lo RA and stop training at the checkpoint from which C(X, Y ) starts to increase.