reproducibilityindex.ai

DGPO: Discovering Multiple Strategies with Diversity-Guided Policy Optimization

Authors: Wentse Chen, Shiyu Huang, Yuan Chiang, Tim Pearce, Wei-Wei Tu, Ting Chen, Jun Zhu

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results show that our method efficiently discovers diverse strategies in a wide variety of reinforcement learning tasks. Compared to baseline methods, DGPO achieves comparable rewards, while discovering more diverse strategies, and often with better sample efficiency. We evaluate our algorithm on several RL benchmarks Multi-agent Particle Environment (MPE) (Mordatch and Abbeel 2018), Star Craft Multi Agent Challenge (SMAC) (Samvelyan et al. 2019), and Atari (Bellemare et al. 2013). We compare our algorithm to four baseline algorithms: ... We performed ablation studies on MPE Spread (hard) tasks, systematically removing each element of our algorithm to assess its impact on diversity.
Researcher Affiliation	Collaboration	Wentse Chen1, Shiyu Huang2, Yuan Chiang3, Tim Pearce4, Wei-Wei Tu2, Ting Chen3, Jun Zhu3 1Carnegie Mellon University, Pittsburgh, USA 24Paradigm Inc., Beijing, China 3Tsinghua University, Beijing, China 4Microsoft Research, Cambridge, United Kingdom
Pseudocode	No	The paper describes the algorithm in text and provides a diagram (Figure 2), but no formal pseudocode block or algorithm box.
Open Source Code	No	The paper mentions that RSPO, a baseline, uses an "open-source implementation", but does not provide any statement or link for open-source code for the proposed DGPO method.
Open Datasets	Yes	We evaluate our algorithm on several RL benchmarks Multi-agent Particle Environment (MPE) (Mordatch and Abbeel 2018), Star Craft Multi Agent Challenge (SMAC) (Samvelyan et al. 2019), and Atari (Bellemare et al. 2013).
Dataset Splits	No	The paper mentions training and testing but does not provide specific details on validation dataset splits or methodology.
Hardware Specification	Yes	All experiments were performed on a machine with 128 GB RAM, one 32core CPU, and one Ge Force RTX 3090 GPU.
Software Dependencies	No	The paper mentions using PPO and TD loss, but does not provide specific version numbers for any software libraries, frameworks, or languages used for implementation.
Experiment Setup	Yes	We set nz = 4 in Spread (easy) and nz = 2 in Spread (hard) to test whether an algorithm can discover all optimal solutions. ... In each environment, we set nz = 2. ... We set nz = 3 and measure the mean win rates over five seeds. ... (Full hyperparameters are listed in the Appendix B.)