Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

CaPo: Cooperative Plan Optimization for Efficient Embodied Multi-Agent Cooperation

Authors: Jie Liu, Pan Zhou, Yingjun Du, Ah-Hwee Tan, Cees G Snoek, Jan-jakob Sonke, Efstratios Gavves

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results on the Three Dworld Multi-Agent Transport and Communicative Watch-And-Help tasks demonstrate Ca Po s much higher task completion rate and efficiency compared with state-of-the-arts.
Researcher Affiliation	Academia	1University of Amsterdam, The Netherlands 2Singapore Management University, Singapore 3The Netherlands Cancer Institute , The Netherlands 4Archimedes/Athena RC, Greece
Pseudocode	No	The paper does not contain explicit pseudocode or algorithm blocks. It describes processes and prompt templates for LLMs, but not structured algorithms for the core methodology.
Open Source Code	Yes	The code is released at https://github.com/jliu4ai/Ca Po.
Open Datasets	Yes	We follow Co ELA, and adopt the Three Dworld Multi-Agent Transport (TDW-MAT) task (Zhang et al., 2023b), and the Communicative Watch-And-Help (C-WAH) task (Zhang et al., 2023b) to test our Ca Po.
Dataset Splits	No	The test set of TDW-MAT consists 24 episodes, which evenly divided into food and stuff tasks. In C-WAH, ... The test set contains 10 episodes, including both symbolic and visual observation settings. The paper mentions the size of the test sets but does not specify training/validation splits or overall dataset partitioning.
Hardware Specification	No	The paper mentions using specific LLMs (GPT3.5-turbo, GPT-4, LLAMA-2-13B-CHAT) but does not provide specific hardware details like GPU/CPU models or memory used for running the experiments or the embodied agents themselves.
Software Dependencies	No	The paper mentions using GPT3.5-turbo and GPT-4 from the Open AI API (Open AI, 2024), and LLAMA-2-13B-CHAT (Touvron et al., 2023), and Mask-RCNN (He et al., 2017) for perception, but does not provide specific version numbers for underlying software dependencies like programming languages (e.g., Python), frameworks (e.g., PyTorch/TensorFlow), or other libraries.
Experiment Setup	Yes	We set default parameters for LLMs: temperature of 0.7, a maximum of 256 output tokens, and top-1 sampling.