Large Language Models Are Not Robust Multiple Choice Selectors
Authors: Chujie Zheng, Hao Zhou, Fandong Meng, Jie Zhou, Minlie Huang
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through extensive empirical analyses with 20 LLMs on three benchmarks, we pinpoint that this behavioral bias primarily stems from LLMs token bias, where the model a priori assigns more probabilistic mass to specific option ID tokens (e.g., A/B/C/D) when predicting answers from the option IDs. We conduct experiments on MMLU (Hendrycks et al., 2020), ARC-Challenge (Clark et al., 2018), and Commonsense QA (CSQA) (Talmor et al., 2019), which are all MCQ benchmarks widely used for LLM evaluation. |
| Researcher Affiliation | Collaboration | Chujie Zheng Hao Zhou Fandong Meng Jie Zhou Minlie Huang The Co AI Group, DCST, BNRist, Tsinghua University, Beijing 100084, China Pattern Recognition Center, We Chat AI, Tencent Inc., China |
| Pseudocode | Yes | Algorithm 1 Pri De: Debiasing with Prior Estimation |
| Open Source Code | Yes | 1Project repository: https://github.com/chujiezheng/LLM-MCQ-Bias. We have released the evaluation data, code, and experimental results at https://github.com/ chujiezheng/LLM-MCQ-Bias to facilitate reproducible research. |
| Open Datasets | Yes | Benchmarks We conduct experiments on MMLU (Hendrycks et al., 2020), ARC-Challenge (Clark et al., 2018), and Commonsense QA (CSQA) (Talmor et al., 2019), which are all MCQ benchmarks widely used for LLM evaluation. MMLU https://github.com/hendrycks/test ARC https://allenai.org/data/arc CSQA https://allenai.org/data/commonsenseqa |
| Dataset Splits | No | Our evaluation mainly considers the 0-shot setting, which excludes biases introduced by in-context examples, but we also conduct 5-shot experiments. The in-context examples come from the development sets and are shared across all the test samples within the same task. This describes the evaluation setup (0-shot/5-shot) and the use of development sets for in-context examples, but does not specify a training/validation/test split for reproducing the model training or hyperparameter tuning. The authors use pre-trained LLMs and evaluate them. |
| Hardware Specification | Yes | Our experiments were run on A100 40GB GPUs (for 70B models) and V100 32GB (for other models). |
| Software Dependencies | No | The paper references frameworks and APIs like Hugging Face LLM Leaderboard, Eleuther AI lm-harness, and OpenAI Evals, but does not provide specific version numbers for these or any other software libraries or programming languages used in their experimental setup. |
| Experiment Setup | Yes | Our evaluation mainly considers the 0-shot setting, which excludes biases introduced by in-context examples, but we also conduct 5-shot experiments. For gpt-3.5-turbo, we compare the golden answer with the first generated token, with the decoding temperature set to 0. For Pri De, we randomly sample K = α|D| test samples as De and report the average results over 5 runs. |