Robust Reinforcement Learning via Adversarial training with Langevin Dynamics

Authors: Parameswaran Kamalaruban, Yu-Ting Huang, Ya-Ping Hsieh, Paul Rolland, Cheng Shi, Volkan Cevher

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our algorithm consistently outperforms existing baselines, in terms of generalization across different training and testing conditions, on several Mu Jo Co environments. Our experiments also show that, even for objective functions that entirely ignore potential environmental shifts, our sampling approach remains highly robust in comparison to standard RL algorithms.
Researcher Affiliation Academia Parameswaran Kamalaruban The Alan Turing Institute kparameswaran@turing.ac.uk Yu-Ting Huang EPFL yu.huang@epfl.ch Ya-Ping Hsieh LIONS, EPFL ya-ping.hsieh@epfl.ch Paul Rolland LIONS, EPFL paul.rolland@epfl.ch Cheng Shi University of Basel cheng.shi@unibas.ch Volkan Cevher LIONS, EPFL volkan.cevher@epfl.ch
Pseudocode Yes Algorithm 1 Mixed NE-LD
Open Source Code No The paper references 'OpenAI baselines. https://github.com/ openai/baselines, 2017' [39], which is a third-party tool, but does not provide an explicit statement or link for the source code of their own methodology.
Open Datasets Yes We evaluate the performance of Algorithm 3 and Algorithm 4 (with GAD and Extra-Adam) on standard continuous control benchmarks available on Open AI Gym [29] utilizing the Mu Jo Co environment [30].
Dataset Splits No The paper mentions training on 0.5M samples and evaluation at test time, but does not provide specific details on dataset splits (e.g., percentages, sample counts, or a clear methodology for training, validation, and test sets).
Hardware Specification No The paper does not provide specific details about the hardware (e.g., exact GPU/CPU models, memory, or cloud instance specifications) used for running the experiments.
Software Dependencies No The paper mentions optimizers like Adam [31] and RMSProp, and frameworks like DDPG [7], but does not provide specific version numbers for any software dependencies required for replication.
Experiment Setup Yes For all the algorithms, we use a two-layer feedforward neural network structure of (64, 64, tanh) for both actors (agent and adversary) and critic. The optimizer we use to update the critic is Adam [31] with a learning rate of 10 3. The target networks are soft-updated with τ = 0.999. For the GAD baseline, the actors are trained with RMSProp optimizer. For our algorithm (Mixed NE-LD), the actors are updated according to Algorithm 1 with warmup steps Kt = min 15, (1 + 10 5)t , and thermal noise σt = σ0 (1 5 10 5)t.