reproducibilityindex.ai

Questioning the Survey Responses of Large Language Models

Authors: Ricardo Dominguez-Olmedo, Moritz Hardt, Celestine Mendler-Dünner

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Evaluating 43 different language models using de-facto standard prompting methodologies, we establish two dominant patterns.
Researcher Affiliation	Academia	Ricardo Dominguez-Olmedo1,2 Moritz Hardt1,2 Celestine Mendler-Dünner1,2,3 1Max-Planck Institute for Intelligent Systems, Tübingen 2Tübingen AI Center 3ELLIS Institute Tübingen {rdo,hardt,cmendler}@tuebingen.mpg.de
Pseudocode	No	The paper describes methods in text and uses figures but does not contain a formally labeled 'Pseudocode' or 'Algorithm' block.
Open Source Code	Yes	We open source the code to replicate all experiments. [...] https://github.com/socialfoundations/surveying-language-models
Open Datasets	Yes	We use the American Community Survey (ACS) Public Use Microdata Sample (PUMS) files made available by the U.S. Census Bureau. [...] https://www.census.gov/programs-surveys/acs/microdata
Dataset Splits	No	The paper uses the American Community Survey (ACS) Public Use Microdata Sample as reference data for evaluation against LLM responses. It does not define train, validation, or test splits for this dataset within the context of its own experiments, as it is used as a benchmark reference rather than for training a model.
Hardware Specification	Yes	We run the models in an internal cluster. The total number of GPU hours needed to complete all experiments is approximately 1500 (NVIDIA A100).
Software Dependencies	No	The paper mentions using the 'Folktables Python package' but does not specify its version or other key software components with their specific versions.
Experiment Setup	Yes	We construct an input prompt of the form Question: <question> \n A. <choice 1 >\n B. <choice 2> \n ... <choice kq> \n Answer: . [...] We survey 43 language models of size varying from 110M to 175B parameters: the base models GPT-2 [Radford et al., 2019], GPT-Neo [Black et al., 2021], Pythia [Biderman et al., 2023], MPT [Mosaic ML, 2023], Llama 2 [Touvron et al., 2023],Llama 3 [Dubey et al., 2024] and GPT-3 [Brown et al., 2020]; as well as the instruct variants of MPT 7B and GPT Neo X 20B, the Dolly fine-tune of Pythia 12B [Databricks, 2023], Llama 2 Chat, Llama 3 Instruct, the text-davinci variants of GPT-3 [Ouyang et al., 2022], and GPT-4 [Open AI, 2023].