Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Exploring Prosocial Irrationality for LLM Agents: A Social Cognition View

Authors: Xuan Liu, Jie ZHANG, HaoYang Shang, Song Guo, Chengxu Yang, Quanyan Zhu

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results on Cog Mir subsets show that LLM Agents and humans exhibit high consistency in irrational and prosocial decisionmaking under uncertain conditions, underscoring the prosociality of LLM Agents as social entities, and highlighting the significance of hallucination properties. Our experimental findings highlight the alignment and distinctions between LLM Agents and humans in the decision-making process.
Researcher Affiliation	Academia	Xuan Liu1, Jie Zhang2, Haoyang Shang3, Song Guo2, Chengxu Yang4, Quanyan Zhu5 Hong Kong Polytechnic University1, Hong Kong University of Science and Technology2, Shanghai Jiao Tong University3, Wuhan University of Technology4, New York University5. Miss Xuan Liu at EMAIL.
Pseudocode	No	The paper describes methods and processes in narrative text and diagrams (e.g., Figure 2: Cog Mir Framework, Figure 9: Architecture of Individual LLM Agents) but does not present any structured pseudocode or algorithm blocks.
Open Source Code	Yes	Figure 1: Sample evaluation subsets in Cog Mir framework. Cog Mir mirrors human cognitive bias and LLM Agents systematic hallucination through social science experiments via representational social and cognitive phenomena. https://github.com/Xuan L17/Cog Mir
Open Datasets	Yes	Utilizing social science literature (Lilienfeld et al., 2017) and existing AI social intelligence test datasets (Sap et al., 2019; Le et al., 2019; Zhou* et al., 2024; Liu et al., 2024), we developed three evaluation datasets...
Dataset Splits	No	The paper describes using custom-constructed datasets like 'Known MCQ' and 'Unknown MCQ' for evaluation, and states 'For each MCQ dataset, we query every MCQ 10 times, resulting in 10 × 100 inquiries.' and 'We randomly selected 10 stories from the dataset, each subjected to ten inquiries.' This describes the evaluation methodology and number of inquiries, but does not provide specific training/test/validation splits (e.g., percentages or sample counts) needed for reproduction.
Hardware Specification	No	The paper does not explicitly describe the specific hardware (e.g., GPU/CPU models, memory details) used to run the experiments. It only mentions using 'Selected LLM Models' without detailing the underlying hardware for their execution.
Software Dependencies	Yes	We select seven state-of-the-art models to serve as participants and evaluation subjects within our framework, specifically: gpt-4-0125-preview (Open AI, 2023), gpt-3.5-turbo (Open AI, 2023), open-mixtral-8x7b (Mixtral.AI, 2024), mistral-medium-2312 (Mixtral.AI, 2024), claude-2.0 (Anthropic, 2024), claude-3.0-opus (Anthropic, 2024), and gemini-1.0-pro (Goo, 2023). For this reason, we employed Sim CSE-Ro BERTalarge (Gao et al., 2021) as a technical discriminator.
Experiment Setup	Yes	All LLM Agents have a fixed temperature parameter of 1 with no model fine-tuning. Our prompt templates consist of two parts: Fixed Context and Scene-Adjustable Features. Prompts are customized by populating Scene-Adjustable Features with specific instances from the abovementioned datasets.