reproducibilityindex.ai

Large Language Models Are Not Robust Multiple Choice Selectors

Authors: Chujie Zheng, Hao Zhou, Fandong Meng, Jie Zhou, Minlie Huang

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Through extensive empirical analyses with 20 LLMs on three benchmarks, we pinpoint that this behavioral bias primarily stems from LLMs token bias, where the model a priori assigns more probabilistic mass to speciﬁc option ID tokens (e.g., A/B/C/D) when predicting answers from the option IDs. We conduct experiments on MMLU (Hendrycks et al., 2020), ARC-Challenge (Clark et al., 2018), and Commonsense QA (CSQA) (Talmor et al., 2019), which are all MCQ benchmarks widely used for LLM evaluation.
Researcher Affiliation	Collaboration	Chujie Zheng Hao Zhou Fandong Meng Jie Zhou Minlie Huang The Co AI Group, DCST, BNRist, Tsinghua University, Beijing 100084, China Pattern Recognition Center, We Chat AI, Tencent Inc., China
Pseudocode	Yes	Algorithm 1 Pri De: Debiasing with Prior Estimation
Open Source Code	Yes	1Project repository: https://github.com/chujiezheng/LLM-MCQ-Bias. We have released the evaluation data, code, and experimental results at https://github.com/ chujiezheng/LLM-MCQ-Bias to facilitate reproducible research.
Open Datasets	Yes	Benchmarks We conduct experiments on MMLU (Hendrycks et al., 2020), ARC-Challenge (Clark et al., 2018), and Commonsense QA (CSQA) (Talmor et al., 2019), which are all MCQ benchmarks widely used for LLM evaluation. MMLU https://github.com/hendrycks/test ARC https://allenai.org/data/arc CSQA https://allenai.org/data/commonsenseqa
Dataset Splits	No	Our evaluation mainly considers the 0-shot setting, which excludes biases introduced by in-context examples, but we also conduct 5-shot experiments. The in-context examples come from the development sets and are shared across all the test samples within the same task. This describes the evaluation setup (0-shot/5-shot) and the use of development sets for in-context examples, but does not specify a training/validation/test split for reproducing the model training or hyperparameter tuning. The authors use pre-trained LLMs and evaluate them.
Hardware Specification	Yes	Our experiments were run on A100 40GB GPUs (for 70B models) and V100 32GB (for other models).
Software Dependencies	No	The paper references frameworks and APIs like Hugging Face LLM Leaderboard, Eleuther AI lm-harness, and OpenAI Evals, but does not provide specific version numbers for these or any other software libraries or programming languages used in their experimental setup.
Experiment Setup	Yes	Our evaluation mainly considers the 0-shot setting, which excludes biases introduced by in-context examples, but we also conduct 5-shot experiments. For gpt-3.5-turbo, we compare the golden answer with the ﬁrst generated token, with the decoding temperature set to 0. For Pri De, we randomly sample K = α\|D\| test samples as De and report the average results over 5 runs.