Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Bias Runs Deep: Implicit Reasoning Biases in Persona-Assigned LLMs
Authors: Shashank Gupta, Vaishnavi Shrivastava, Ameet Deshpande, Ashwin Kalyan, Peter Clark, Ashish Sabharwal, Tushar Khot
ICLR 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our study covers 24 reasoning datasets (spanning mathematics, law, medicine, morals, and more), 4 LLMs (2 versions of Chat GPT-3.5, GPT4-Turbo, and Llama-2-70b-chat), and 19 diverse personas (e.g., an Asian person ) spanning 5 socio-demographic groups: race, gender, religion, disability, and political affiliation. Our experiments unveil that LLMs harbor deep rooted bias against various socio-demographics underneath a veneer of fairness. |
| Researcher Affiliation | Collaboration | 1Allen Institute for AI 2Stanford University 3Princeton University |
| Pseudocode | No | The paper does not include pseudocode or a clearly labeled algorithm block. |
| Open Source Code | Yes | Code and model outputs: https://allenai.github.io/persona-bias. |
| Open Datasets | Yes | We select 24 datasets from MMLU (Hendrycks et al., 2021), Big-Bench-Hard (Suzgun et al., 2022), and MBPP (Austin et al., 2021) to evaluate the knowledge and reasoning abilities of LLMs in diverse domains. |
| Dataset Splits | Yes | For all datasets, we make use of the official test partitions in our evaluations. |
| Hardware Specification | No | To fit such a model within our GPUs, we use the AWQ quantized (Lin et al., 2023) model from Hugging Face (The Bloke/Llama-2-70b-Chat-AWQ). |
| Software Dependencies | Yes | We primarily focus on Chat GPT-3.5 (gpt-3.5-turbo-0613) as it has demonstrated impressive persona-following (Park et al., 2023) and reasoning (Qin et al., 2023) abilities. We also experimented with the latest release (Nov. 2023) of Chat GPT-3.5 (gpt-3.5-turbo-1106), GPT-4Turbo (gpt-4-turbo-1106), and Llama-2-70b-chat, and include their results in Appendix D. |
| Experiment Setup | Yes | We use a max token length of 1024, temperature 0, and a top-p value of 1 (equivalent to greedy decoding). |