Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Wisdom from Diversity: Bias Mitigation Through Hybrid Human-LLM Crowds

Authors: Axel Abels, Tom Lenaerts

IJCAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	By analyzing LLM responses to bias-eliciting headlines, we find that these models often mirror human biases. To address this, we explore crowd-based strategies for mitigating bias through response aggregation. We first demonstrate that simply averaging responses from multiple LLMs, intended to leverage the wisdom of the crowd , can exacerbate existing biases due to the limited diversity within LLM crowds. In contrast, we show that locally weighted aggregation methods more effectively leverage the wisdom of the LLM crowd, achieving both bias mitigation and improved accuracy. Finally, recognizing the complementary strengths of LLMs (accuracy) and humans (diversity), we demonstrate that hybrid crowds containing both significantly enhance performance and further reduce biases across ethnic and gender-related contexts.
Researcher Affiliation	Academia	Axel Abels1,2,3 , Tom Lenaerts1,2,3,4 1Machine Learning Group, Universit e Libre de Bruxelles 2AI Lab, Vrije Universiteit Brussel 3FARI, AI for the Common-Good Institute, ULB-VUB 4Center for Human-Compatible AI, UC Berkeley EMAIL
Pseudocode	No	The paper describes methods and calculations (e.g., Equation 1 for counterfactual bias) and refers to existing methods like Expertise Trees [Abels et al., 2023], but it does not include any explicitly labeled pseudocode or algorithm blocks with structured steps in the main text.
Open Source Code	Yes	Code available at [Abels and Lenaerts, 2025]. The reference provides the URL: [Abels and Lenaerts, 2025] Axel Abels and Tom Lenaerts. Wisdom from diversity: Bias mitigation through hybrid human-llm crowds. https://github.com/axelabels/Hybrid Crowds, 2025. Accessed: 2025-05-22.
Open Datasets	Yes	In [Abels et al., 2024], participants evaluated a balanced set of genuine and altered news headlines, where demographic groups were swapped to create counterfactual pairs. For example, the headline Men more likely than women to say they are financially better off since last year was altered to Women more likely than men to say they are financially better off since last year . Headlines described positive or negative outcomes for various demographic groups (gender, ethnicity, age), and participants rated their authenticity on a scale from very unlikely to very likely. This design allowed the authors to measure bias by comparing error rates across demographic groups and outcomes. For example, discrepancies in error rates for positive vs. negative outcomes for white individuals revealed underlying biases.
Dataset Splits	Yes	For both weighted average methods, we use cross-validation to ensure weights were not trained on headlines they are evaluated on.
Hardware Specification	No	The resources and services used in this work were provided by the VSC (Flemish Supercomputer Center), funded by the Research Foundation Flanders (FWO) and the Flemish Government. This mentions a supercomputer center but lacks specific hardware details like GPU or CPU models.
Software Dependencies	No	The paper discusses the use of various LLMs from different providers (Open AI, Anthropic, Google, Meta, Mistral, Alibaba, and Deep Seek AI) but does not specify the version numbers of any ancillary software, programming languages, or libraries used for the implementation or experiments.
Experiment Setup	Yes	All LLMs were prompted in a 4-shot setting with instructions designed to replicate the guidance provided to human participants. Prompts included a brief explanation of the task and example responses. Details on the prompting procedure and parameters are given in the supplementary materials.