Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

ACC-Collab: An Actor-Critic Approach to Multi-Agent LLM Collaboration

Authors: Andrew Estornell, Jean-Francois Ton, Yuanshun Yao, Yang Liu

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate that ACC-Collab outperforms Sot A multi-agent techniques on a wide array of benchmarks. 5 EXPERIMENTS
Researcher Affiliation	Collaboration	Andrew Estornell Byte Dance Research EMAIL Jean-Franc ois Ton Byte Dance Research EMAIL Yuanshun Yao Meta Gen AI EMAIL Yang Liu University of California, Santa Cruz EMAIL
Pseudocode	Yes	Algorithm 1: Trajectory generation and selection Data: Actor and critic: θa, θc, Distribution of tasks D , Reward threshold ε Result: A dataset of trajectories D D
Open Source Code	Yes	Code available at https://github.com/LlenRotse/ACC-Collab
Open Datasets	Yes	Benchmarks To evaluate the efficacy of ACC-Collab we make use of 5 standard benchmark tasks: Bool Q Clark et al. (2019) 12k yes-no reading comprehension questions, MMLU Hendrycks et al. (2020) 15k multiple choice questions covering a wide array of subjects and difficulty, BBH Suzgun et al. (2022) 5k mixed-type questions SCIQ Welbl et al. (2017) 13k multiple-choice science questions, ARC Chollet (2019) 7k multiple-choice reasoning-based questions.
Dataset Splits	Yes	Each dataset is split into a training set, a validation set, and a testing set. For datasets that come with an explicit partition of these sets we use the given partitions; this includes Bool Q, MMLU, SCIQ, and ARC. For BBH, we randomly sample roughly 25% and 10% of the questions from each category in BBH to create a test and validation set, respectively; this comes out to 1260 questions for the test set and 500 questions for the validation set. All results are reported on questions in the test set.
Hardware Specification	Yes	Compute All training was performed on a single Nvidia-H800 GPU. Inference for Llama-3 and Mistral based models is performed on a single Nvidia-v100 GPU, for Gemma-2 based models we used a single Nvidia-H800.
Software Dependencies	No	The paper mentions "VLLM library" and "trl library" but does not specify their version numbers, which are required for a reproducible description of ancillary software.
Experiment Setup	Yes	Training for all models was performed via the trl library, using Lo RAs of size 256. When training ACC-Collab with DPO we use a negative log-likelihood (NLL) regularization term (with weight 1) as outlined in Pang et al. (2024a).