DGPO: Discovering Multiple Strategies with Diversity-Guided Policy Optimization

Authors: Wentse Chen, Shiyu Huang, Yuan Chiang, Tim Pearce, Wei-Wei Tu, Ting Chen, Jun Zhu

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results show that our method efficiently discovers diverse strategies in a wide variety of reinforcement learning tasks. Compared to baseline methods, DGPO achieves comparable rewards, while discovering more diverse strategies, and often with better sample efficiency. We evaluate our algorithm on several RL benchmarks Multi-agent Particle Environment (MPE) (Mordatch and Abbeel 2018), Star Craft Multi Agent Challenge (SMAC) (Samvelyan et al. 2019), and Atari (Bellemare et al. 2013). We compare our algorithm to four baseline algorithms: ... We performed ablation studies on MPE Spread (hard) tasks, systematically removing each element of our algorithm to assess its impact on diversity.
Researcher Affiliation Collaboration Wentse Chen1, Shiyu Huang2, Yuan Chiang3, Tim Pearce4, Wei-Wei Tu2, Ting Chen3, Jun Zhu3 1Carnegie Mellon University, Pittsburgh, USA 24Paradigm Inc., Beijing, China 3Tsinghua University, Beijing, China 4Microsoft Research, Cambridge, United Kingdom
Pseudocode No The paper describes the algorithm in text and provides a diagram (Figure 2), but no formal pseudocode block or algorithm box.
Open Source Code No The paper mentions that RSPO, a baseline, uses an "open-source implementation", but does not provide any statement or link for open-source code for the proposed DGPO method.
Open Datasets Yes We evaluate our algorithm on several RL benchmarks Multi-agent Particle Environment (MPE) (Mordatch and Abbeel 2018), Star Craft Multi Agent Challenge (SMAC) (Samvelyan et al. 2019), and Atari (Bellemare et al. 2013).
Dataset Splits No The paper mentions training and testing but does not provide specific details on validation dataset splits or methodology.
Hardware Specification Yes All experiments were performed on a machine with 128 GB RAM, one 32core CPU, and one Ge Force RTX 3090 GPU.
Software Dependencies No The paper mentions using PPO and TD loss, but does not provide specific version numbers for any software libraries, frameworks, or languages used for implementation.
Experiment Setup Yes We set nz = 4 in Spread (easy) and nz = 2 in Spread (hard) to test whether an algorithm can discover all optimal solutions. ... In each environment, we set nz = 2. ... We set nz = 3 and measure the mean win rates over five seeds. ... (Full hyperparameters are listed in the Appendix B.)