reproducibilityindex.ai

Leveraging Large Language Models for Multiple Choice Question Answering

Authors: Joshua Robinson, David Wingate

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We show that a model with high MCSB ability performs much better with the natural approach than with the traditional approach across 20 diverse datasets and largely closes the gap with the SOTA, suggesting that the MCQA ability of LLMs has been previously underestimated. We evaluate the performance of a model with strong MCSB ability and multiple choice prompts across a set of 20 diverse datasets.
Researcher Affiliation	Academia	Joshua Robinson & David Wingate Department of Computer Science Brigham Young University joshua robinson@byu.edu, wingated@cs.byu.edu. Work done while at Brigham Young University. Now at University of Southern California.
Pseudocode	No	The paper describes methods textually and uses figures for visualization, but does not include structured pseudocode or algorithm blocks.
Open Source Code	Yes	Code is available.1 1https://github.com/BYU-PCCL/leveraging-llms-for-mcqa
Open Datasets	Yes	We evaluate the performance of a model with strong MCSB ability and multiple choice prompts across a set of 20 diverse datasets. Examples of questions from each of these datasets can be found in Appendix A. (The paper lists 20 datasets with citations, e.g., 'ARC (Clark et al., 2018)', indicating they are publicly available.)
Dataset Splits	Yes	We evaluate 5-shot model performance on a commonsense reasoning dataset (Open Book QA (Mihaylov et al., 2018)), a cloze/completion dataset (Story Cloze (Mostafazadeh et al., 2016)), and a reading comprehension dataset (RACE-m (Lai et al., 2017)). We randomly sample 500 instances for both Story Cloze and RACE-m to reduce computational costs. K is always chosen to be as high as possible while respecting Codex’s 4,000 token context limit.
Hardware Specification	No	The paper mentions using 'Open AI Codex Beta 3' and 'API requests' and discusses 'computational cost' and the time taken for some computations ('took over a week'), but it does not specify the underlying hardware (e.g., GPU models, CPU types) used for these operations or for the experiments.
Software Dependencies	No	The paper mentions using 'Hugging Face Transformers' and refers to model checkpoints and API endpoints, but it does not provide specific version numbers for any software dependencies.
Experiment Setup	Yes	K is always chosen to be as high as possible while respecting Codex’s 4,000 token context limit. Our prompt phrasing is consistent across tasks: We prefix the raw question with Question:, list answer options with associated letters (like A. Lollipop), and finish prompts with Answer:. We measure model probability for an answer via the probability of the symbol associated with it.