Breaking the Barrier: Enhanced Utility and Robustness in Smoothed DRL Agents
Authors: Chung-En Sun, Sicun Gao, Tsui-Wei Weng
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate the performance of S-DQN and S-PPO in terms of both robust reward and robustness guarantee. Our proposed agents not only achieve high clean rewards but also provide robustness certification, setting new state-of-the-art across various standard RL environments, including Atari games (Mnih et al., 2013) and continuous control tasks (Brockman et al., 2016). In our DQN settings, the evaluations are done in three Atari environments Pong, Freeway, and Road Runner. In our PPO settings, the evaluations are done on two continuous control tasks in the Mujoco environments Walker and Hopper. |
| Researcher Affiliation | Academia | 1UC San Diego. Correspondence to: Chung-En Sun <cesun@ucsd.edu>, Tsui-Wei Weng <lweng@ucsd.edu>. |
| Pseudocode | Yes | A.1. Detailed algorithms of S-DQN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 A.1.1. Training algorithm of S-DQN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 Algorithm 1 Train S-DQN ... Algorithm 2 Test S-DQN ... Algorithm 3 Smoothed Attack (S-PGD) ... A.2. Detailed algorithms of S-PPO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 A.2.1. Training algorithm of S-PPO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 Algorithm 4 Train S-PPO ... Algorithm 5 Collect Trajectories function |
| Open Source Code | Yes | Our code is available at: https://github.com/Trustworthy ML-Lab/Robust High Util Smoothed DRL |
| Open Datasets | Yes | We follow the previous robust DRL literature to conduct experiments on Atari (Mnih et al., 2013) and Mujoco (Brockman et al., 2016) benchmarks. In our DQN settings, the evaluations are done in three Atari environments Pong, Freeway, and Road Runner. In our PPO settings, the evaluations are done on two continuous control tasks in the Mujoco environments Walker and Hopper. |
| Dataset Splits | No | The paper mentions training on Atari and Mujoco environments and reports median performance across 15 runs, but it does not specify explicit training, validation, or test dataset splits (e.g., percentages or sample counts for each split). |
| Hardware Specification | No | The paper mentions that "The training time of S-DQN is roughly 12 hours on our hardware" but does not provide any specific details about the hardware, such as CPU or GPU models, memory, or cloud instance types. |
| Software Dependencies | No | The paper states: "Our DQN implementation is based on the SADQN (Zhang et al., 2020) and CROP (Wu et al., 2022)." and "Our PPO implementation is based on the SAPPO (Zhang et al., 2020), Radial PPO (Oikarinen et al., 2021), ATLAPPO (Zhang et al., 2021), and PA-ATLAPPO (Sun et al., 2022)." While it mentions software implementations/frameworks, it does not provide specific version numbers for any software dependencies (e.g., Python, PyTorch, TensorFlow versions). |
| Experiment Setup | Yes | We train our S-DQN for 300, 000 frames in Pong, Freeway, and Road Runner. The smoothing variance σ for S-DQN is set to 0.1 in Pong, 0.1 in Freeway, and 0.05 in Road Runner. We train S-PPO for 2000000 steps in Walker and Hopper. The smoothing variance σ for S-PPO is set to 0.2 in all environments. The ℓ attack budget for all the attacks for PPO (MAD, Min-RS, Optimal Attack, PA-AD attack) is set to 0.075. All the experiment results under attack are obtained by taking the average of 5 episodes (for DQN) and 50 episodes (for PPO). |