Leveraging Large Language Models for Multiple Choice Question Answering

Authors: Joshua Robinson, David Wingate

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We show that a model with high MCSB ability performs much better with the natural approach than with the traditional approach across 20 diverse datasets and largely closes the gap with the SOTA, suggesting that the MCQA ability of LLMs has been previously underestimated. We evaluate the performance of a model with strong MCSB ability and multiple choice prompts across a set of 20 diverse datasets.
Researcher Affiliation Academia Joshua Robinson & David Wingate Department of Computer Science Brigham Young University joshua robinson@byu.edu, wingated@cs.byu.edu. Work done while at Brigham Young University. Now at University of Southern California.
Pseudocode No The paper describes methods textually and uses figures for visualization, but does not include structured pseudocode or algorithm blocks.
Open Source Code Yes Code is available.1 1https://github.com/BYU-PCCL/leveraging-llms-for-mcqa
Open Datasets Yes We evaluate the performance of a model with strong MCSB ability and multiple choice prompts across a set of 20 diverse datasets. Examples of questions from each of these datasets can be found in Appendix A. (The paper lists 20 datasets with citations, e.g., 'ARC (Clark et al., 2018)', indicating they are publicly available.)
Dataset Splits Yes We evaluate 5-shot model performance on a commonsense reasoning dataset (Open Book QA (Mihaylov et al., 2018)), a cloze/completion dataset (Story Cloze (Mostafazadeh et al., 2016)), and a reading comprehension dataset (RACE-m (Lai et al., 2017)). We randomly sample 500 instances for both Story Cloze and RACE-m to reduce computational costs. K is always chosen to be as high as possible while respecting Codex’s 4,000 token context limit.
Hardware Specification No The paper mentions using 'Open AI Codex Beta 3' and 'API requests' and discusses 'computational cost' and the time taken for some computations ('took over a week'), but it does not specify the underlying hardware (e.g., GPU models, CPU types) used for these operations or for the experiments.
Software Dependencies No The paper mentions using 'Hugging Face Transformers' and refers to model checkpoints and API endpoints, but it does not provide specific version numbers for any software dependencies.
Experiment Setup Yes K is always chosen to be as high as possible while respecting Codex’s 4,000 token context limit. Our prompt phrasing is consistent across tasks: We prefix the raw question with Question:, list answer options with associated letters (like A. Lollipop), and finish prompts with Answer:. We measure model probability for an answer via the probability of the symbol associated with it.