reproducibilityindex.ai

Whose Opinions Do Language Models Reflect?

Authors: Shibani Santurkar, Esin Durmus, Faisal Ladhak, Cinoo Lee, Percy Liang, Tatsunori Hashimoto

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We put forth a quantitative framework to investigate the opinions reflected by LMs by leveraging high-quality public opinion polls. Using this framework, we create Opinion QA, a dataset for evaluating the alignment of LM opinions with those of 60 US demographic groups over topics ranging from abortion to automation. Across topics, we find substantial misalignment between the views reflected by current LMs and those of US demographic groups: on par with the Democrat Republican divide on climate change. Notably, this misalignment persists even after explicitly steering the LMs towards particular groups. Our analysis not only confirms prior observations about the left-leaning tendencies of some human feedback-tuned LMs, but also surfaces groups whose opinions are poorly reflected by current LMs (e.g., 65+ and widowed individuals).
Researcher Affiliation	Academia	Shibani Santurkar 1 Esin Durmus 1 Faisal Ladhak 2 Cinoo Lee 1 Percy Liang 1 Tatsunori Hashimoto 1 ... 1Stanford University 2Columbia University.
Pseudocode	No	The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code	Yes	Our code and data are available at https://github.com/tatsu-lab/opinions_qa.
Open Datasets	Yes	Using this framework, we build the Opinion QA dataset using Pew Research’s American Trends Panels, with 1498 questions spanning topics such as science, politics, and personal relationships. ... Our dataset is derived from the annual Pew American Trends Panel (ATP) survey. ... Further documentation can be found beta.openai.com/docs/model-index-for-researchers and docs.ai21.com/docs/. ... A.1. Pew research surveys: Our dataset is derived from the annual Pew American Trends Panel (ATP) survey. Below, we provide a brief summary of how the data collection process is conducted, and refer the reader to pewresearch.org/our-methods/u-s-surveys/the-american-trends-panel/ and pewresearch.org/our-methods/u-s-surveys/writing-survey-questions/ for more details.
Dataset Splits	No	The paper evaluates existing language models on the Opinion QA dataset but does not specify any train/validation/test splits for their experimental setup, as they are evaluating pre-trained models.
Hardware Specification	No	The paper lists the language models used (e.g., Open AI's ada, davinci, text-davinci-001/002/003, and AI21 Labs' j1-Grande, j1-Jumbo), along with their approximate sizes, but does not specify the hardware (e.g., GPU models, CPU types, or cloud instances) used for running the evaluations.
Software Dependencies	No	The paper mentions accessing models via APIs and refers to general model documentation, but it does not specify software dependencies (e.g., programming languages, libraries, or frameworks) with version numbers used for running the experiments.
Experiment Setup	Yes	To obtain this, we prompt the model and obtain the next-token log probabilities. Specifically, we measure the log probabilities assigned to each of the answer choices (e.g., A , B , ... in Figure 1) ignoring all other possible completions (See Appendix A.3 for details). ... A.5. Temperature scaling: In Section 4.1, we compare the model opinion distribution to a sharpened version of its human counterpart. This sharpening makes the human opinion distribution collapse towards its dominant mode. To do so, we use the standard temperature scaling approach from Guo et al. (2017). We use a temperature of 1e-3 in our analysis, but find that our results are fairly robust to the choice of temperature.