Evaluating the Moral Beliefs Encoded in LLMs
Authors: Nino Scherrer, Claudia Shi, Amir Feder, David Blei
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | This paper presents a case study on the design, administration, post-processing, and evaluation of surveys on large language models (LLMs). It comprises two components: (1) A statistical method for eliciting beliefs encoded in LLMs. We introduce statistical measures and evaluation metrics that quantify the probability of an LLM "making a choice", the associated uncertainty, and the consistency of that choice. (2) We apply this method to study what moral beliefs are encoded in different LLMs, especially in ambiguous cases where the right choice is not obvious. We design a large-scale survey comprising 680 high-ambiguity moral scenarios (...) We administer the survey to 28 open- and closed-source LLMs. |
| Researcher Affiliation | Collaboration | Nino Scherrer 1 , Claudia Shi 1,2 , Amir Feder 2 and David M. Blei 2 1 FAR AI, 2 Columbia University |
| Pseudocode | No | The paper describes methods using mathematical definitions and equations (e.g., Definition 1, Eq. 1, 2, 3, 4, 5, 6, 7, 8) but does not present any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code and data are publicly available1. 1Code: https://github.com/ninodimontalcino/moralchoice |
| Open Datasets | Yes | Moral Choice, a survey dataset containing 1767 moral scenarios and responses from 28 open- and closed-source LLMs. Code and data are publicly available1. 1Code: https://github.com/ninodimontalcino/moralchoice |
| Dataset Splits | No | The paper describes two sets of scenarios: low-ambiguity (687 scenarios) and high-ambiguity (680 scenarios), which are used as the full evaluation datasets. It does not define train, validation, or test splits of these datasets because the paper evaluates pre-trained LLMs rather than training new models for which such splits would be necessary. |
| Hardware Specification | No | The paper details the LLMs used in the study, including their access types (API or Hugging Face Hub) and parameter sizes (Table 1, Table 14). However, it does not provide specific hardware specifications (e.g., GPU models, CPU types, or memory) for the machines on which the experiments were conducted or the API calls were made. |
| Software Dependencies | No | The paper refers to various LLMs and their frameworks (e.g., OpenAI's GPT-4, Anthropic's Claude, Google's PaLM 2, Hugging Face models like Flan-T5, BLOOMZ, OPT-IML). However, it does not explicitly list specific software dependencies with version numbers (e.g., Python version, library versions like PyTorch or TensorFlow) required for reproducibility. |
| Experiment Setup | Yes | The sampling is performed using a temperature of 1, which controls the randomness of the LLM s responses. We then employ an iterative rule-based mapping procedure to map from sequences to actions. The details of the mapping are provided in Appendix B.2. For high-ambiguity scenarios, we set M to 10, while for low-ambiguity scenarios, we set M to 5. We assign equal weights to each question template. |