Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
CaPo: Cooperative Plan Optimization for Efficient Embodied Multi-Agent Cooperation
Authors: Jie Liu, Pan Zhou, Yingjun Du, Ah-Hwee Tan, Cees G Snoek, Jan-jakob Sonke, Efstratios Gavves
ICLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results on the Three Dworld Multi-Agent Transport and Communicative Watch-And-Help tasks demonstrate Ca Po s much higher task completion rate and efficiency compared with state-of-the-arts. |
| Researcher Affiliation | Academia | 1University of Amsterdam, The Netherlands 2Singapore Management University, Singapore 3The Netherlands Cancer Institute , The Netherlands 4Archimedes/Athena RC, Greece |
| Pseudocode | No | The paper does not contain explicit pseudocode or algorithm blocks. It describes processes and prompt templates for LLMs, but not structured algorithms for the core methodology. |
| Open Source Code | Yes | The code is released at https://github.com/jliu4ai/Ca Po. |
| Open Datasets | Yes | We follow Co ELA, and adopt the Three Dworld Multi-Agent Transport (TDW-MAT) task (Zhang et al., 2023b), and the Communicative Watch-And-Help (C-WAH) task (Zhang et al., 2023b) to test our Ca Po. |
| Dataset Splits | No | The test set of TDW-MAT consists 24 episodes, which evenly divided into food and stuff tasks. In C-WAH, ... The test set contains 10 episodes, including both symbolic and visual observation settings. The paper mentions the size of the test sets but does not specify training/validation splits or overall dataset partitioning. |
| Hardware Specification | No | The paper mentions using specific LLMs (GPT3.5-turbo, GPT-4, LLAMA-2-13B-CHAT) but does not provide specific hardware details like GPU/CPU models or memory used for running the experiments or the embodied agents themselves. |
| Software Dependencies | No | The paper mentions using GPT3.5-turbo and GPT-4 from the Open AI API (Open AI, 2024), and LLAMA-2-13B-CHAT (Touvron et al., 2023), and Mask-RCNN (He et al., 2017) for perception, but does not provide specific version numbers for underlying software dependencies like programming languages (e.g., Python), frameworks (e.g., PyTorch/TensorFlow), or other libraries. |
| Experiment Setup | Yes | We set default parameters for LLMs: temperature of 0.7, a maximum of 256 output tokens, and top-1 sampling. |