Language Grounded Multi-agent Reinforcement Learning with Human-interpretable Communication

Authors: Huao Li, Hossein Nourkhiz Mahjoub, Behdad Chalaki, Vaishnav Tadiparthi, Kwonjoon Lee, Ehsan Moradi Pari, Charles Lewis, Katia Sycara

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our results demonstrate that introducing language grounding not only maintains task performance but also accelerates the emergence of communication. Furthermore, the learned communication protocols exhibit zero-shot generalization capabilities in ad-hoc teamwork scenarios with unseen teammates and novel task states.
Researcher Affiliation Collaboration Huao Li University of Pittsburgh hul52@pitt.edu Hossein Nourkhiz Mahjoub Honda Research Institute USA, Inc. hossein_nourkhizmahjoub@honda-ri.com Behdad Chalaki Honda Research Institute USA, Inc. behdad_chalaki@honda-ri.com Vaishnav Tadiparthi Honda Research Institute USA, Inc. vaishnav_tadiparthi@honda-ri.com Kwonjoon Lee Honda Research Institute USA, Inc. kwonjoon_lee@honda-ri.com Ehsan Moradi-Pari Honda Research Institute USA, Inc. emoradipari@honda-ri.com Michael Lewis University of Pittsburgh ml@sis.pitt.edu Katia Sycara Carnegie Mellon University sycara@andrew.cmu.edu
Pseudocode No The paper describes the computational pipeline and methods in prose, but does not include structured pseudocode or algorithm blocks.
Open Source Code Yes Code available at https://romanlee6. github.io/langground_web/.
Open Datasets No In order to construct dataset D, we collected expert trajectories from embodied LLM agents powered by GPT-4 in interactive task scenarios.
Dataset Splits Yes The y-axis is task performance measured by the episode length until task completion, which is lower the better. The x-axis is the number of training timestamps. Shaded areas are standard errors over three random seeds.
Hardware Specification Yes All experiments were conducted on a machine with a 14-core Intel(R) Core(TM) i9-12900H CPU and 64GB memory.
Software Dependencies Yes We use Open AI s API to call gpt-4-0125-preview as the backbone pre-trained model and set the temperature parameter to 0 to ensure consistent outputs.
Experiment Setup Yes The batch size is 500, and the number of update iterations in an epoch is 10. Training on ppv0 and USAR takes 2000 epochs and 1e7 timestamps, which takes about 4 hours to complete. Training on ppv1 takes 500 epochs and 2.5e6 timestamps, which takes about 1.5 hours to complete. We use a learning rate of 0.0001 for USAR and 0.001 for pp. MARL agent s action policy is an LSTM with a hidden layer of size 256. Communication vectors are exchanged one round at each timestamp. The supervised learning weight λ is 1 in pp and 10 in USAR.