Offline Multi-Agent Reinforcement Learning with Knowledge Distillation

Authors: Wei-Cheng Tseng, Tsun-Hsuan Johnson Wang, Yen-Chen Lin, Phillip Isola

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We execute a series of experiments to evaluate whether the proposed method is effective at solving offline MARL problems. Specifically, our experiments seek to answer the following questions: first, does our method perform favorably against a wide range of existing approaches based on sequence modeling, imitation learning, and offline reinforcement learning (Section 4.1)? Specifically, we compare our method with model-free offline MARL methods based on TD-learning and MADT, a concurrent work that also solves MARL via sequence modeling. Next, we conduct careful ablation studies to evaluate the contribution of each component within our framework (Section 4.2). To test the sample efficiency of various approaches, we further show the performance of various methods when different numbers of demonstrations are provided (Section 4.4). Finally, we analyze the convergence rate (Section 4.5) and discuss the scalability (Section 5) of our method.
Researcher Affiliation Academia Wei-Cheng Tseng1, Tsun-Hsuan Wang2, Lin Yen-Chen2, Phillip Isola2 1University of Toronto, 2MIT CSAIL weicheng.tseng@mail.utoronto.ca, {tsunw,yenchenl,phillipi}@mit.edu
Pseudocode Yes Algorithm 1 Our Offline MARL
Open Source Code No The paper refers to a project website (https://weichengtseng.github.io/project_website/neurips22/index.html) for qualitative results, but does not explicitly state that the source code for their methodology is released or available at this link or elsewhere.
Open Datasets Yes The offline datasets for SMAC [35] are collected by running a policy trained with MAPPO [46]. We also provide more information about datasets in appendix C. [35] Mikayel Samvelyan, Tabish Rashid, Christian Schroeder de Witt, Gregory Farquhar, Nantas Nardelli, Tim G. J. Rudner, Chia-Man Hung, Philiph H. S. Torr, Jakob Foerster, and Shimon Whiteson. The Star Craft Multi-Agent Challenge. Co RR, abs/1902.04043, 2019.
Dataset Splits No The paper mentions collecting offline datasets and details training parameters in Appendix A, but it does not specify explicit training, validation, and test splits (e.g., percentages or sample counts) for reproducibility.
Hardware Specification No The paper does not provide specific hardware details (e.g., CPU/GPU models, memory, or cloud instance types) used for running the experiments.
Software Dependencies No The paper mentions using PPO [36] and MAPPO [46] for data collection and various baselines, but it does not provide specific version numbers for software dependencies or libraries used in the experiments.
Experiment Setup Yes In summary, the overall learning objective for agent i is Li total = Li action + αLi rel + βLi KL (4) where α and β are hyperparameters that determine the importance of the proposed policy distillation. ... We empirically found that setting e = 4 improves the convergence speed, but the converged performance is insensitive to e. ... For detail experimental setting about our approach and baselines, please refer to appendix A.