Language Grounded Multi-agent Reinforcement Learning with Human-interpretable Communication
Authors: Huao Li, Hossein Nourkhiz Mahjoub, Behdad Chalaki, Vaishnav Tadiparthi, Kwonjoon Lee, Ehsan Moradi Pari, Charles Lewis, Katia Sycara
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our results demonstrate that introducing language grounding not only maintains task performance but also accelerates the emergence of communication. Furthermore, the learned communication protocols exhibit zero-shot generalization capabilities in ad-hoc teamwork scenarios with unseen teammates and novel task states. |
| Researcher Affiliation | Collaboration | Huao Li University of Pittsburgh hul52@pitt.edu Hossein Nourkhiz Mahjoub Honda Research Institute USA, Inc. hossein_nourkhizmahjoub@honda-ri.com Behdad Chalaki Honda Research Institute USA, Inc. behdad_chalaki@honda-ri.com Vaishnav Tadiparthi Honda Research Institute USA, Inc. vaishnav_tadiparthi@honda-ri.com Kwonjoon Lee Honda Research Institute USA, Inc. kwonjoon_lee@honda-ri.com Ehsan Moradi-Pari Honda Research Institute USA, Inc. emoradipari@honda-ri.com Michael Lewis University of Pittsburgh ml@sis.pitt.edu Katia Sycara Carnegie Mellon University sycara@andrew.cmu.edu |
| Pseudocode | No | The paper describes the computational pipeline and methods in prose, but does not include structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code available at https://romanlee6. github.io/langground_web/. |
| Open Datasets | No | In order to construct dataset D, we collected expert trajectories from embodied LLM agents powered by GPT-4 in interactive task scenarios. |
| Dataset Splits | Yes | The y-axis is task performance measured by the episode length until task completion, which is lower the better. The x-axis is the number of training timestamps. Shaded areas are standard errors over three random seeds. |
| Hardware Specification | Yes | All experiments were conducted on a machine with a 14-core Intel(R) Core(TM) i9-12900H CPU and 64GB memory. |
| Software Dependencies | Yes | We use Open AI s API to call gpt-4-0125-preview as the backbone pre-trained model and set the temperature parameter to 0 to ensure consistent outputs. |
| Experiment Setup | Yes | The batch size is 500, and the number of update iterations in an epoch is 10. Training on ppv0 and USAR takes 2000 epochs and 1e7 timestamps, which takes about 4 hours to complete. Training on ppv1 takes 500 epochs and 2.5e6 timestamps, which takes about 1.5 hours to complete. We use a learning rate of 0.0001 for USAR and 0.001 for pp. MARL agent s action policy is an LSTM with a hidden layer of size 256. Communication vectors are exchanged one round at each timestamp. The supervised learning weight λ is 1 in pp and 10 in USAR. |