Questioning the Survey Responses of Large Language Models
Authors: Ricardo Dominguez-Olmedo, Moritz Hardt, Celestine Mendler-Dünner
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Evaluating 43 different language models using de-facto standard prompting methodologies, we establish two dominant patterns. |
| Researcher Affiliation | Academia | Ricardo Dominguez-Olmedo1,2 Moritz Hardt1,2 Celestine Mendler-Dünner1,2,3 1Max-Planck Institute for Intelligent Systems, Tübingen 2Tübingen AI Center 3ELLIS Institute Tübingen {rdo,hardt,cmendler}@tuebingen.mpg.de |
| Pseudocode | No | The paper describes methods in text and uses figures but does not contain a formally labeled 'Pseudocode' or 'Algorithm' block. |
| Open Source Code | Yes | We open source the code to replicate all experiments. [...] https://github.com/socialfoundations/surveying-language-models |
| Open Datasets | Yes | We use the American Community Survey (ACS) Public Use Microdata Sample (PUMS) files made available by the U.S. Census Bureau. [...] https://www.census.gov/programs-surveys/acs/microdata |
| Dataset Splits | No | The paper uses the American Community Survey (ACS) Public Use Microdata Sample as reference data for evaluation against LLM responses. It does not define train, validation, or test splits for this dataset within the context of its own experiments, as it is used as a benchmark reference rather than for training a model. |
| Hardware Specification | Yes | We run the models in an internal cluster. The total number of GPU hours needed to complete all experiments is approximately 1500 (NVIDIA A100). |
| Software Dependencies | No | The paper mentions using the 'Folktables Python package' but does not specify its version or other key software components with their specific versions. |
| Experiment Setup | Yes | We construct an input prompt of the form Question: <question> \n A. <choice 1 >\n B. <choice 2> \n ... <choice kq> \n Answer: . [...] We survey 43 language models of size varying from 110M to 175B parameters: the base models GPT-2 [Radford et al., 2019], GPT-Neo [Black et al., 2021], Pythia [Biderman et al., 2023], MPT [Mosaic ML, 2023], Llama 2 [Touvron et al., 2023],Llama 3 [Dubey et al., 2024] and GPT-3 [Brown et al., 2020]; as well as the instruct variants of MPT 7B and GPT Neo X 20B, the Dolly fine-tune of Pythia 12B [Databricks, 2023], Llama 2 Chat, Llama 3 Instruct, the text-davinci variants of GPT-3 [Ouyang et al., 2022], and GPT-4 [Open AI, 2023]. |