Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

MoodAngels: A Retrieval-augmented Multi-agent Framework for Psychiatry Diagnosis

Authors: Mengxi Xiao, Ben Liu, He Li, Jimin Huang, Qianqian Xie, Xiaofen Zong, Mang Ye, Min Peng

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results demonstrate that Mood Angels outperforms conventional methods, with our baseline agent achieving 12.3% higher accuracy than GPT-4o on real-world cases, and our full multi-agent system delivering further improvements. Evaluation in the Mood Syn dataset demonstrates exceptional fidelity, accurately reproducing both the core statistical patterns and complex relationships present in the original data while maintaining strong utility for machine learning applications.
Researcher Affiliation	Collaboration	1School of Artificial Intelligence, Wuhan University 2 Center for Language and Information Research, Wuhan University 3School of Computer Science, Wuhan University 4Department of Psychiatry, Renmin Hospital of Wuhan University 5The Fin AI 6Taikang Center for Life and Medical Sciences, Wuhan University
Pseudocode	No	The paper describes the framework and methods in narrative text and figures, such as Figure 1 and detailed descriptions in Section 2, but does not present any explicit pseudocode or algorithm blocks.
Open Source Code	Yes	2Code and synthetic data sample are available in Mood Angels.
Open Datasets	Yes	Complementing this framework, we introduce Mood Syn, an open-source dataset of 1,173 synthetic psychiatric cases that preserves clinical validity while ensuring patient privacy. Experimental results demonstrate that Mood Angels outperforms conventional methods, with our baseline agent achieving 12.3% higher accuracy than GPT-4o on real-world cases, and our full multi-agent system delivering further improvements. Evaluation in the Mood Syn dataset demonstrates exceptional fidelity, accurately reproducing both the core statistical patterns and complex relationships present in the original data while maintaining strong utility for machine learning applications.
Dataset Splits	Yes	We partitioned the dataset such that 80% of the cases are used as historical cases for retrieval, while the remaining 20% serve as the test set.
Hardware Specification	Yes	These testing procedures take place on a computational infrastructure consisting of two NVIDIA A800 Tensor Core GPUs, equipped with 80GB of memory.
Software Dependencies	Yes	LLa MA3-8B-Instruct [21], Mistral-7B-Instruct-v0.3 [22], GPT-4o (gpt-4o-2024-08-06) [23], Deep Seek-V3 [24], and medgemma-27b-text-it [25].
Experiment Setup	Yes	The retrieval number k for Angel.D and Angel.C is set to k = 5 by default. For each model, we employ default parameter settings, utilizing official models for open-source LLMs obtained from Hugging Face or the API from the official website.