Towards Optimal Adversarial Robust Q-learning with Bellman Infinity-error
Authors: Haoran Li, Zicheng Zhang, Wang Luo, Congying Han, Yudong Hu, Tiande Guo, Shichen Liao
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section, we conduct extensive comparison and ablation experiments to validate the rationality of our theoretical analysis and the effectiveness of CAR-DQN. |
| Researcher Affiliation | Academia | 1School of Mathematical Sciences, University of Chinese Academy of Sciences, Beijing, China. Correspondence to: Congying Han <hancy@ucas.ac.cn>. |
| Pseudocode | Yes | Algorithm 1 Consistent Adversarial Robust Deep Q-Learning (CAR-DQN). |
| Open Source Code | Yes | Our code is available at https://github.com/leoranlmia/CAR-DQN. |
| Open Datasets | Yes | We conduct experiments on four Atari video games (Brockman et al., 2016), including Pong, Freeway, Bank Heist, and Road Runner. |
| Dataset Splits | No | The paper describes data preprocessing (e.g., 'pre-process the input images into 84 84 grayscale images and normalize the pixel values to the range [0, 1]'), but does not explicitly state training, validation, or test dataset splits. |
| Hardware Specification | No | The paper states 'All these models are trained for 4.5 million frames on identical hardware' but does not provide specific details such as GPU/CPU models, memory, or other hardware specifications. |
| Software Dependencies | No | The paper mentions software components like 'Double Dueling DQN' and 'Adam optimizer', but does not provide specific version numbers for any software, libraries, or frameworks used. |
| Experiment Setup | Yes | We implement CAR-DQN based on Double Dueling DQN (Van Hasselt et al., 2016; Wang et al., 2016) and train all baselines and CAR-DQN for 4.5 million steps... We update the target network every 2000 steps, and set learning rate as 1.25 10 4, batch size as 32, exploration ϵexp-end as 0.01, soft coefficient λ = 1.0 and discount factor as 0.99. We use a replay buffer with a capacity of 2 105 and Adam optimizer (Kingma & Ba, 2014) with β1 = 0.9 and β2 = 0.999. |