Robust Reinforcement Learning as a Stackelberg Game via Adaptively-Regularized Adversarial Training

Authors: Peide Huang, Mengdi Xu, Fei Fang, Ding Zhao

IJCAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct experiments to answer the following questions: (Q1) Does our method produce challenging yet solvable environments? (Q2) Does our method improve the robustness and training stability? (Q3) How does the choice of α influence the performance of the protagonist? and Our algorithm demonstrates better training stability and robustness against different testing conditions in the single-agent robotics control and multi-agent highway merging tasks.
Researcher Affiliation Academia Carnegie Mellon University {peideh, mengdixu}@andrew.cmu.edu, feif@cs.cmu.edu, dingzhao@cmu.edu
Pseudocode Yes Algorithm 1: Stackelberg Policy Gradient (Stack-PG), Algorithm 2: Multi Policy Gradient, Algorithm 3: Solving RRL-Stack with Stack-PG
Open Source Code No The paper does not provide an explicit statement about releasing open-source code for the described methodology or a direct link to a code repository. The footnote links to their arXiv paper, not source code.
Open Datasets Yes We modify the Lunar Lander environment in Open AI Gym [Brockman et al., 2016] to simulate the effects of actuation delay. In highway merging tasks [Leurent, 2018].
Dataset Splits No The paper does not provide specific dataset split information (e.g., percentages or sample counts for training, validation, and test sets) for reproducibility. It discusses training and evaluation within the RL environment context, but not data splits for a static dataset.
Hardware Specification No The paper does not provide any specific hardware details such as CPU/GPU models, processor types, or memory amounts used for running the experiments.
Software Dependencies No The paper mentions "Open AI Gym" and "Adam optimizer" but does not provide specific version numbers for these or other software dependencies.
Experiment Setup Yes Without specific mention, the policies are parametrized by MLPs with two hidden layers. All the agents are trained using policy gradient algorithms with Adam optimizer and the same learning rate. In the experiments, we use a ratio of 1:3 to alternate between the updates of the protagonist and the adversary.