Questioning the Survey Responses of Large Language Models

Authors: Ricardo Dominguez-Olmedo, Moritz Hardt, Celestine Mendler-Dünner

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Evaluating 43 different language models using de-facto standard prompting methodologies, we establish two dominant patterns.
Researcher Affiliation Academia Ricardo Dominguez-Olmedo1,2 Moritz Hardt1,2 Celestine Mendler-Dünner1,2,3 1Max-Planck Institute for Intelligent Systems, Tübingen 2Tübingen AI Center 3ELLIS Institute Tübingen {rdo,hardt,cmendler}@tuebingen.mpg.de
Pseudocode No The paper describes methods in text and uses figures but does not contain a formally labeled 'Pseudocode' or 'Algorithm' block.
Open Source Code Yes We open source the code to replicate all experiments. [...] https://github.com/socialfoundations/surveying-language-models
Open Datasets Yes We use the American Community Survey (ACS) Public Use Microdata Sample (PUMS) files made available by the U.S. Census Bureau. [...] https://www.census.gov/programs-surveys/acs/microdata
Dataset Splits No The paper uses the American Community Survey (ACS) Public Use Microdata Sample as reference data for evaluation against LLM responses. It does not define train, validation, or test splits for this dataset within the context of its own experiments, as it is used as a benchmark reference rather than for training a model.
Hardware Specification Yes We run the models in an internal cluster. The total number of GPU hours needed to complete all experiments is approximately 1500 (NVIDIA A100).
Software Dependencies No The paper mentions using the 'Folktables Python package' but does not specify its version or other key software components with their specific versions.
Experiment Setup Yes We construct an input prompt of the form Question: <question> \n A. <choice 1 >\n B. <choice 2> \n ... <choice kq> \n Answer: . [...] We survey 43 language models of size varying from 110M to 175B parameters: the base models GPT-2 [Radford et al., 2019], GPT-Neo [Black et al., 2021], Pythia [Biderman et al., 2023], MPT [Mosaic ML, 2023], Llama 2 [Touvron et al., 2023],Llama 3 [Dubey et al., 2024] and GPT-3 [Brown et al., 2020]; as well as the instruct variants of MPT 7B and GPT Neo X 20B, the Dolly fine-tune of Pythia 12B [Databricks, 2023], Llama 2 Chat, Llama 3 Instruct, the text-davinci variants of GPT-3 [Ouyang et al., 2022], and GPT-4 [Open AI, 2023].