reproducibilityindex.ai

Evaluating the Moral Beliefs Encoded in LLMs

Authors: Nino Scherrer, Claudia Shi, Amir Feder, David Blei

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	This paper presents a case study on the design, administration, post-processing, and evaluation of surveys on large language models (LLMs). It comprises two components: (1) A statistical method for eliciting beliefs encoded in LLMs. We introduce statistical measures and evaluation metrics that quantify the probability of an LLM "making a choice", the associated uncertainty, and the consistency of that choice. (2) We apply this method to study what moral beliefs are encoded in different LLMs, especially in ambiguous cases where the right choice is not obvious. We design a large-scale survey comprising 680 high-ambiguity moral scenarios (...) We administer the survey to 28 open- and closed-source LLMs.
Researcher Affiliation	Collaboration	Nino Scherrer 1 , Claudia Shi 1,2 , Amir Feder 2 and David M. Blei 2 1 FAR AI, 2 Columbia University
Pseudocode	No	The paper describes methods using mathematical definitions and equations (e.g., Definition 1, Eq. 1, 2, 3, 4, 5, 6, 7, 8) but does not present any structured pseudocode or algorithm blocks.
Open Source Code	Yes	Code and data are publicly available1. 1Code: https://github.com/ninodimontalcino/moralchoice
Open Datasets	Yes	Moral Choice, a survey dataset containing 1767 moral scenarios and responses from 28 open- and closed-source LLMs. Code and data are publicly available1. 1Code: https://github.com/ninodimontalcino/moralchoice
Dataset Splits	No	The paper describes two sets of scenarios: low-ambiguity (687 scenarios) and high-ambiguity (680 scenarios), which are used as the full evaluation datasets. It does not define train, validation, or test splits of these datasets because the paper evaluates pre-trained LLMs rather than training new models for which such splits would be necessary.
Hardware Specification	No	The paper details the LLMs used in the study, including their access types (API or Hugging Face Hub) and parameter sizes (Table 1, Table 14). However, it does not provide specific hardware specifications (e.g., GPU models, CPU types, or memory) for the machines on which the experiments were conducted or the API calls were made.
Software Dependencies	No	The paper refers to various LLMs and their frameworks (e.g., OpenAI's GPT-4, Anthropic's Claude, Google's PaLM 2, Hugging Face models like Flan-T5, BLOOMZ, OPT-IML). However, it does not explicitly list specific software dependencies with version numbers (e.g., Python version, library versions like PyTorch or TensorFlow) required for reproducibility.
Experiment Setup	Yes	The sampling is performed using a temperature of 1, which controls the randomness of the LLM s responses. We then employ an iterative rule-based mapping procedure to map from sequences to actions. The details of the mapping are provided in Appendix B.2. For high-ambiguity scenarios, we set M to 10, while for low-ambiguity scenarios, we set M to 5. We assign equal weights to each question template.