Continuously Discovering Novel Strategies via Reward-Switching Policy Optimization

Authors: Zihan Zhou, Wei Fu, Bingliang Zhang, Yi Wu

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments show that RSPO is able to discover a wide spectrum of strategies in a variety of domains, ranging from single-agent navigation tasks and Mu Jo Co control to multi-agent stag-hunt games and the Star Craft II Multi-Agent Challenge.
Researcher Affiliation Academia Zihan Zhou 1Z, Wei Fu 2\, Bingliang Zhang2, Yi Wu23 1 CS Department, University of Toronto, 2 IIIS, Tsinghua University, 3 Shanghai Qi Zhi Institute
Pseudocode No No pseudocode or algorithm block was explicitly labeled or presented in a structured format.
Open Source Code No The paper provides a link for GIF demonstrations, but no explicit statement or link to the open-source code for the proposed RSPO methodology was found.
Open Datasets Yes The implementation of 4-Goals is based on Multi-Agent Particle Environments1 (Mordatch & Abbeel, 2018). We use the Mu Jo Co environments from Gym (version 0.17.3). The Star Craft II Multi-Agent Challenge (SMAC) (Rashid et al., 2019)
Dataset Splits No No specific training/validation/test dataset splits (e.g., percentages, sample counts, or citations to predefined splits) were explicitly provided.
Hardware Specification Yes Our implementation is based on PPO (Schulman et al., 2017) on a desktop machine with one CPU and one NVIDIA RTX3090 GPU.
Software Dependencies No The paper mentions Gym (version 0.17.3) but does not provide specific version numbers for other key software components like programming languages (e.g., Python), deep learning frameworks (e.g., PyTorch, TensorFlow), or other libraries.
Experiment Setup Yes The PPO hyperparameters we use for each experiment is shown in Table 11. There are three additional hyperparameters for RSPO: the automatic threshold coefficient α mentioned in Section 3.5, weight of the behavior-driven intrinsic reward λint B and weight of the reward-driven intrinsic reward λint R . These hyperparameters are shown in Table 12.