Breaking the Barrier: Enhanced Utility and Robustness in Smoothed DRL Agents

Authors: Chung-En Sun, Sicun Gao, Tsui-Wei Weng

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate the performance of S-DQN and S-PPO in terms of both robust reward and robustness guarantee. Our proposed agents not only achieve high clean rewards but also provide robustness certification, setting new state-of-the-art across various standard RL environments, including Atari games (Mnih et al., 2013) and continuous control tasks (Brockman et al., 2016). In our DQN settings, the evaluations are done in three Atari environments Pong, Freeway, and Road Runner. In our PPO settings, the evaluations are done on two continuous control tasks in the Mujoco environments Walker and Hopper.
Researcher Affiliation Academia 1UC San Diego. Correspondence to: Chung-En Sun <cesun@ucsd.edu>, Tsui-Wei Weng <lweng@ucsd.edu>.
Pseudocode Yes A.1. Detailed algorithms of S-DQN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 A.1.1. Training algorithm of S-DQN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 Algorithm 1 Train S-DQN ... Algorithm 2 Test S-DQN ... Algorithm 3 Smoothed Attack (S-PGD) ... A.2. Detailed algorithms of S-PPO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 A.2.1. Training algorithm of S-PPO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 Algorithm 4 Train S-PPO ... Algorithm 5 Collect Trajectories function
Open Source Code Yes Our code is available at: https://github.com/Trustworthy ML-Lab/Robust High Util Smoothed DRL
Open Datasets Yes We follow the previous robust DRL literature to conduct experiments on Atari (Mnih et al., 2013) and Mujoco (Brockman et al., 2016) benchmarks. In our DQN settings, the evaluations are done in three Atari environments Pong, Freeway, and Road Runner. In our PPO settings, the evaluations are done on two continuous control tasks in the Mujoco environments Walker and Hopper.
Dataset Splits No The paper mentions training on Atari and Mujoco environments and reports median performance across 15 runs, but it does not specify explicit training, validation, or test dataset splits (e.g., percentages or sample counts for each split).
Hardware Specification No The paper mentions that "The training time of S-DQN is roughly 12 hours on our hardware" but does not provide any specific details about the hardware, such as CPU or GPU models, memory, or cloud instance types.
Software Dependencies No The paper states: "Our DQN implementation is based on the SADQN (Zhang et al., 2020) and CROP (Wu et al., 2022)." and "Our PPO implementation is based on the SAPPO (Zhang et al., 2020), Radial PPO (Oikarinen et al., 2021), ATLAPPO (Zhang et al., 2021), and PA-ATLAPPO (Sun et al., 2022)." While it mentions software implementations/frameworks, it does not provide specific version numbers for any software dependencies (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup Yes We train our S-DQN for 300, 000 frames in Pong, Freeway, and Road Runner. The smoothing variance σ for S-DQN is set to 0.1 in Pong, 0.1 in Freeway, and 0.05 in Road Runner. We train S-PPO for 2000000 steps in Walker and Hopper. The smoothing variance σ for S-PPO is set to 0.2 in all environments. The ℓ attack budget for all the attacks for PPO (MAD, Min-RS, Optimal Attack, PA-AD attack) is set to 0.075. All the experiment results under attack are obtained by taking the average of 5 episodes (for DQN) and 50 episodes (for PPO).