Robust Reinforcement Learning as a Stackelberg Game via Adaptively-Regularized Adversarial Training
Authors: Peide Huang, Mengdi Xu, Fei Fang, Ding Zhao
IJCAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct experiments to answer the following questions: (Q1) Does our method produce challenging yet solvable environments? (Q2) Does our method improve the robustness and training stability? (Q3) How does the choice of α influence the performance of the protagonist? and Our algorithm demonstrates better training stability and robustness against different testing conditions in the single-agent robotics control and multi-agent highway merging tasks. |
| Researcher Affiliation | Academia | Carnegie Mellon University {peideh, mengdixu}@andrew.cmu.edu, feif@cs.cmu.edu, dingzhao@cmu.edu |
| Pseudocode | Yes | Algorithm 1: Stackelberg Policy Gradient (Stack-PG), Algorithm 2: Multi Policy Gradient, Algorithm 3: Solving RRL-Stack with Stack-PG |
| Open Source Code | No | The paper does not provide an explicit statement about releasing open-source code for the described methodology or a direct link to a code repository. The footnote links to their arXiv paper, not source code. |
| Open Datasets | Yes | We modify the Lunar Lander environment in Open AI Gym [Brockman et al., 2016] to simulate the effects of actuation delay. In highway merging tasks [Leurent, 2018]. |
| Dataset Splits | No | The paper does not provide specific dataset split information (e.g., percentages or sample counts for training, validation, and test sets) for reproducibility. It discusses training and evaluation within the RL environment context, but not data splits for a static dataset. |
| Hardware Specification | No | The paper does not provide any specific hardware details such as CPU/GPU models, processor types, or memory amounts used for running the experiments. |
| Software Dependencies | No | The paper mentions "Open AI Gym" and "Adam optimizer" but does not provide specific version numbers for these or other software dependencies. |
| Experiment Setup | Yes | Without specific mention, the policies are parametrized by MLPs with two hidden layers. All the agents are trained using policy gradient algorithms with Adam optimizer and the same learning rate. In the experiments, we use a ratio of 1:3 to alternate between the updates of the protagonist and the adversary. |