Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Many LLMs Are More Utilitarian Than One

Authors: Anita Keshmirian, Razan Baltaji, Babak Hemmatian, Hadi Asghari, Lav R. Varshney

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We test six models on well-established sets of moral dilemmas across two conditions: (1) Solo, where models reason independently, and (2) Group, where they engage in multi-turn discussions in pairs or triads. ... To address these gaps, the current study uses controlled experiments and validated psychological measurement tools to investigate how collective moral reasoning occurs in LLM-MAS.
Researcher Affiliation	Academia	1 Forward College 2 University of Illinois at Urbana-Champaign 3 University of Nebraska, Lincoln 4 Technische Universität Berlin 5 Humboldt Institute for Internet and Society 6 Stony Brook University EMAIL, EMAIL
Pseudocode	No	The paper describes its methods and experimental setup in prose, including sections like "Experimental Design" and "Data Preparation", but it does not contain any clearly labeled pseudocode blocks, algorithm figures, or code-like structured steps.
Open Source Code	Yes	Code available at: https://github.com/baltaci-r/Moral Agents
Open Datasets	Yes	We use publicly available datasets from human (psychology) studies that are widely recognized and available as presented in Sec 3.2. ... For example, the Körner s Moral Framework dataset is used under its Creative Commons license7. The Oxford Utilitarianism Scale is used under its Creative Commons license 8. The dataset from [36] is used under the Creative Commons license for Science9. All datasets and models are appropriately cited in the paper.
Dataset Splits	No	The paper mentions collecting data in different conditions (Solo, Pair, Triad) and sampling for human evaluation, for example: "A stratified sample comprising approximately 1% of the model-generated arguments, balanced by dilemma type, model, and condition, was double-rated..." However, it does not provide explicit training, validation, or test dataset splits in the context of machine learning model development or evaluation, as the LLMs are used as agents rather than being trained in this paper.
Hardware Specification	No	The paper states that "For open-source LLMs, we use the default temperature setting provided by Ollama for each model" and "most open-source models ran on the Together AI platform", and refers to "Open AI playground" for closed-source models. However, it does not provide specific hardware details such as exact GPU/CPU models, processor types, or memory amounts used for running these models.
Software Dependencies	Yes	We compared them across conditions and discussion stages using mixed-effects regression models with the <ordinal> package in R[28]... R package version 2023.12-4.1.
Experiment Setup	Yes	For closed-source LLMs, we use a temperature of 0.7, which is the default setting in the Open AI playground. ... Each trial was repeated three times to ensure reliability (henceforth called repetitions).