reproducibilityindex.ai

Continuously Discovering Novel Strategies via Reward-Switching Policy Optimization

Authors: Zihan Zhou, Wei Fu, Bingliang Zhang, Yi Wu

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments show that RSPO is able to discover a wide spectrum of strategies in a variety of domains, ranging from single-agent navigation tasks and Mu Jo Co control to multi-agent stag-hunt games and the Star Craft II Multi-Agent Challenge.
Researcher Affiliation	Academia	Zihan Zhou 1Z, Wei Fu 2\, Bingliang Zhang2, Yi Wu23 1 CS Department, University of Toronto, 2 IIIS, Tsinghua University, 3 Shanghai Qi Zhi Institute
Pseudocode	No	No pseudocode or algorithm block was explicitly labeled or presented in a structured format.
Open Source Code	No	The paper provides a link for GIF demonstrations, but no explicit statement or link to the open-source code for the proposed RSPO methodology was found.
Open Datasets	Yes	The implementation of 4-Goals is based on Multi-Agent Particle Environments1 (Mordatch & Abbeel, 2018). We use the Mu Jo Co environments from Gym (version 0.17.3). The Star Craft II Multi-Agent Challenge (SMAC) (Rashid et al., 2019)
Dataset Splits	No	No specific training/validation/test dataset splits (e.g., percentages, sample counts, or citations to predefined splits) were explicitly provided.
Hardware Specification	Yes	Our implementation is based on PPO (Schulman et al., 2017) on a desktop machine with one CPU and one NVIDIA RTX3090 GPU.
Software Dependencies	No	The paper mentions Gym (version 0.17.3) but does not provide specific version numbers for other key software components like programming languages (e.g., Python), deep learning frameworks (e.g., PyTorch, TensorFlow), or other libraries.
Experiment Setup	Yes	The PPO hyperparameters we use for each experiment is shown in Table 11. There are three additional hyperparameters for RSPO: the automatic threshold coefﬁcient α mentioned in Section 3.5, weight of the behavior-driven intrinsic reward λint B and weight of the reward-driven intrinsic reward λint R . These hyperparameters are shown in Table 12.