reproducibilityindex.ai

Capturing Failures of Large Language Models via Human Cognitive Biases

Authors: Erik Jones, Jacob Steinhardt

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Speciﬁcally, we use cognitive biases as motivation to (i) generate hypotheses for problems that models may have, and (ii) develop experiments that elicit these problems. Using code generation as a case study, we ﬁnd that Open AI s Codex errs predictably based on how the input prompt is framed, adjusts outputs towards anchors, and is biased towards outputs that mimic frequent training examples. We then use our framework to elicit high-impact errors such as incorrectly deleting ﬁles. Our results indicate that experimental methodology from cognitive science can help characterize how machine learning systems behave.
Researcher Affiliation	Academia	Erik Jones UC Berkeley erjones@berkeley.edu Jacob Steinhardt UC Berkeley jsteinhardt@berkeley.edu
Pseudocode	No	The paper contains figures and examples of code, but no structured pseudocode or algorithm blocks are explicitly labeled or presented.
Open Source Code	Yes	Code for this paper is available at https://github.com/ejones313/codex-cog-biases.
Open Datasets	Yes	We use the Human Eval benchmark as a diverse source of normal prompts [Chen et al., 2021].
Dataset Splits	No	The paper uses pre-trained models and evaluates them on prompts from benchmarks like Human Eval. While Human Eval provides test cases for verifying the correctness of generated code, the paper does not specify how the prompts themselves are split into training, validation, or test sets for its own experimental methodology.
Hardware Specification	No	The paper states that for Codex, they use the OpenAI API, and for Code Gen, they 'run inference locally' but do not specify any details about the hardware (e.g., CPU/GPU models, memory) used for this local inference or for querying the API.
Software Dependencies	No	The paper mentions the use of Open AI's Codex (davinci-001) and Salesforce's Code Gen, but it does not provide specific version numbers for any ancillary software dependencies or libraries used to run the experiments locally.
Experiment Setup	Yes	We use the Open AI API to query the davinci-001 version of Codex, and use greedy decoding to generate solutions. We test ﬁve framing lines: raise Not Implemented Error , pass , assert False , return False , and print("Hello world!") . We consider all 12 combinations of the binary operations sum, difference, and product, with unary operations square, cube, quadruple, and square root.