Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Collaborative Reasoner: Self-Improving Social Agents with Synthetic Conversations

Authors: Ansong Ni, Ruta Desai, Yang Li, Xinjie Lei, Dong Wang, Jiemin Zhang, Jane Yu, Ramya Raghavendra, Gargi Ghosh, Shang-Wen Li, Asli Celikyilmaz

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Through comprehensive evaluation on six collaborative reasoning tasks covering domains of coding, math, scientific QA and social reasoning, we show that current models cannot effectively collaborate due to undesirable social behaviors, collapsing even on problems that they can solve singlehandedly. To improve the collaborative reasoning capabilities of LLMs, we propose a self-play method to generate synthetic multi-turn preference data and further train the language models to be better collaborators. Experiments with Llama-3.1, Ministral and Qwen-2.5 models show that our proposed self-improvement approach consistently outperforms finetuned chain-of-thought performance of the same base model, yielding gains up to 16.7% absolute. Human evaluations show that the models exhibit more effective disagreement and produce more natural conversations after training on our synthetic interaction data.
Researcher Affiliation	Industry	Ansong Ni Ruta Desai Yang Li Xinjie Lei Dong Wang Jiemin Zhang Jane Yu Ramya Raghavendra Gargi Ghosh Daniel Li Asli Celikyilmaz EMAIL
Pseudocode	No	The paper describes the self-training pipeline in Section 4.1 with a diagram (Figure 2) but does not provide explicit pseudocode or algorithm blocks.
Open Source Code	Yes	We open-source our code for Coral and Matrix to support future research on developing social agents that can partner with humans and other agents. 1Code for Coral exps: https://github.com/facebookresearch/collaborative-reasoner. Code for Matrix infra: https://github.com/facebookresearch/matrix.
Open Datasets	Yes	MATH [16] consists of 12.5K challenging competition-level mathematics problems... MMLU-Pro [47] contains approximately 12k questions from 14 STEM disciplines... GPQA [34] is a graduate-level multiple choice question answering benchmark containing 448 questions... Explore To M [36] is a theory-of-mind reasoning benchmark... Hi-To M [15] is a benchmark consisting of 600 examples... MBPP-CR is a code reasoning benchmark adapted from [3]
Dataset Splits	Yes	MATH [16] ... We train with the 7.5k training examples and evaluation on the first 1k test examples; MMLU-Pro [47] ... we re-split the original 12K test data into 10.8K examples for training and 1.2K examples for testing; Explore To M [36] ... And we split the dataset 10.4K/1.5K/1.5K train/val/test sets. MBPP-CR ... For the train split, we generate 10 solution samples per task, and 2 samples per task for the test split, resulting in 4k training and 1k test examples for MBPP-CR.
Hardware Specification	Yes	All experiments are conducted on AWS p5.48xlarge instances, each with 8x H100 80Gi B GPUs.
Software Dependencies	Yes	For both SFT and DPO, we use the fairseq2 [5] and TRL 8 to fully-parameterized train the models for 1,000 3,000 steps with batch size of 20 50.
Experiment Setup	Yes	For both evaluation and synthetic conversation generation, we limit the conversation to be at most 20 turns (i.e., 10 rounds), and end the conversation early when agreement is reached. During tree sampling, we set the turn-level beam size d = 5 and independently sample 5 trees for each problem, and we set sample size = 25 for SFT methods to ensure fair comparison. Subsequently during filtering, we limit at most 2 pairs of preference pairs generated from the same level (i.e., turn) and at most 20 preference pairs generated across all trees for the same problem... For both SFT and DPO, we use the fairseq2 [5] and TRL 8 to fully-parameterized train the models for 1,000 3,000 steps with batch size of 20 50. We limit the sequence length (input + output) to be 8,192