Large Language Models Are Not Robust Multiple Choice Selectors

Authors: Chujie Zheng, Hao Zhou, Fandong Meng, Jie Zhou, Minlie Huang

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through extensive empirical analyses with 20 LLMs on three benchmarks, we pinpoint that this behavioral bias primarily stems from LLMs token bias, where the model a priori assigns more probabilistic mass to specific option ID tokens (e.g., A/B/C/D) when predicting answers from the option IDs. We conduct experiments on MMLU (Hendrycks et al., 2020), ARC-Challenge (Clark et al., 2018), and Commonsense QA (CSQA) (Talmor et al., 2019), which are all MCQ benchmarks widely used for LLM evaluation.
Researcher Affiliation Collaboration Chujie Zheng Hao Zhou Fandong Meng Jie Zhou Minlie Huang The Co AI Group, DCST, BNRist, Tsinghua University, Beijing 100084, China Pattern Recognition Center, We Chat AI, Tencent Inc., China
Pseudocode Yes Algorithm 1 Pri De: Debiasing with Prior Estimation
Open Source Code Yes 1Project repository: https://github.com/chujiezheng/LLM-MCQ-Bias. We have released the evaluation data, code, and experimental results at https://github.com/ chujiezheng/LLM-MCQ-Bias to facilitate reproducible research.
Open Datasets Yes Benchmarks We conduct experiments on MMLU (Hendrycks et al., 2020), ARC-Challenge (Clark et al., 2018), and Commonsense QA (CSQA) (Talmor et al., 2019), which are all MCQ benchmarks widely used for LLM evaluation. MMLU https://github.com/hendrycks/test ARC https://allenai.org/data/arc CSQA https://allenai.org/data/commonsenseqa
Dataset Splits No Our evaluation mainly considers the 0-shot setting, which excludes biases introduced by in-context examples, but we also conduct 5-shot experiments. The in-context examples come from the development sets and are shared across all the test samples within the same task. This describes the evaluation setup (0-shot/5-shot) and the use of development sets for in-context examples, but does not specify a training/validation/test split for reproducing the model training or hyperparameter tuning. The authors use pre-trained LLMs and evaluate them.
Hardware Specification Yes Our experiments were run on A100 40GB GPUs (for 70B models) and V100 32GB (for other models).
Software Dependencies No The paper references frameworks and APIs like Hugging Face LLM Leaderboard, Eleuther AI lm-harness, and OpenAI Evals, but does not provide specific version numbers for these or any other software libraries or programming languages used in their experimental setup.
Experiment Setup Yes Our evaluation mainly considers the 0-shot setting, which excludes biases introduced by in-context examples, but we also conduct 5-shot experiments. For gpt-3.5-turbo, we compare the golden answer with the first generated token, with the decoding temperature set to 0. For Pri De, we randomly sample K = α|D| test samples as De and report the average results over 5 runs.