Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
ACC-Collab: An Actor-Critic Approach to Multi-Agent LLM Collaboration
Authors: Andrew Estornell, Jean-Francois Ton, Yuanshun Yao, Yang Liu
ICLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate that ACC-Collab outperforms Sot A multi-agent techniques on a wide array of benchmarks. 5 EXPERIMENTS |
| Researcher Affiliation | Collaboration | Andrew Estornell Byte Dance Research EMAIL Jean-Franc ois Ton Byte Dance Research EMAIL Yuanshun Yao Meta Gen AI EMAIL Yang Liu University of California, Santa Cruz EMAIL |
| Pseudocode | Yes | Algorithm 1: Trajectory generation and selection Data: Actor and critic: θa, θc, Distribution of tasks D , Reward threshold ε Result: A dataset of trajectories D D |
| Open Source Code | Yes | Code available at https://github.com/LlenRotse/ACC-Collab |
| Open Datasets | Yes | Benchmarks To evaluate the efficacy of ACC-Collab we make use of 5 standard benchmark tasks: Bool Q Clark et al. (2019) 12k yes-no reading comprehension questions, MMLU Hendrycks et al. (2020) 15k multiple choice questions covering a wide array of subjects and difficulty, BBH Suzgun et al. (2022) 5k mixed-type questions SCIQ Welbl et al. (2017) 13k multiple-choice science questions, ARC Chollet (2019) 7k multiple-choice reasoning-based questions. |
| Dataset Splits | Yes | Each dataset is split into a training set, a validation set, and a testing set. For datasets that come with an explicit partition of these sets we use the given partitions; this includes Bool Q, MMLU, SCIQ, and ARC. For BBH, we randomly sample roughly 25% and 10% of the questions from each category in BBH to create a test and validation set, respectively; this comes out to 1260 questions for the test set and 500 questions for the validation set. All results are reported on questions in the test set. |
| Hardware Specification | Yes | Compute All training was performed on a single Nvidia-H800 GPU. Inference for Llama-3 and Mistral based models is performed on a single Nvidia-v100 GPU, for Gemma-2 based models we used a single Nvidia-H800. |
| Software Dependencies | No | The paper mentions "VLLM library" and "trl library" but does not specify their version numbers, which are required for a reproducible description of ancillary software. |
| Experiment Setup | Yes | Training for all models was performed via the trl library, using Lo RAs of size 256. When training ACC-Collab with DPO we use a negative log-likelihood (NLL) regularization term (with weight 1) as outlined in Pang et al. (2024a). |