Belief-Enriched Pessimistic Q-Learning against Adversarial State Perturbations

Authors: Xiaolin Sun, Zizhan Zheng

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section, we evaluate our belief-enriched pessimistic DQN algorithms by conducting experiments on three environments, a continuous state Gridworld environment (shown in Figure 1a) for BP-DQN and two Atari games, Pong and Freeway for DP-DQN-O and DP-DQN-F...Our method achieves high robustness and significantly outperforms existing solutions under strong attacks while maintaining comparable performance under relatively weak attacks. Further, its training complexity is comparable to SA-MDP and Woca R-RL and is much lower than alternating training-based approaches.
Researcher Affiliation Academia Xiaolin Sun Department of Computer Science Tulane University New Orleans, LA 70118 xsun12@tulane.edu Zizhan Zheng Department of Computer Science Tulane University New Orleans, LA 70118 zzheng3@tulane.edu
Pseudocode Yes Algorithm 1: Pessimistic Q-Learning; Algorithm 2: Belief Update; Algorithm 3: Pessimistic Q-Iteration; Algorithm 4: Belief-Enriched Pessimistic DQN (BP-DQN) Training; Algorithm 5: Belief-Enriched Pessimistic DQN (BP-DQN) Testing; Algorithm 6: Diffusion-Assisted Pessimistic DQN (DP-DQN) Training; Algorithm 7: Diffusion-Assisted Pessimistic DQN (DP-DQN) Testing
Open Source Code Yes Our code is available at https://github.com/SliencerX/Belief-enriched-robust-Q-learning.
Open Datasets Yes For Atari games, we choose Pong and Freeway provided by the Open AI Gym (Brockman et al., 2016).
Dataset Splits No The paper mentions training for a certain number of frames (e.g., '1 million frames for the continuous Gridworld environment and 6 million frames for the Atari games') and testing in '10 randomized environments', but it does not specify explicit training, validation, or test dataset splits (e.g., percentages or counts).
Hardware Specification Yes All training and testing are done on a machine equipped with an i9-12900KF CPU and a single RTX 3090 GPU.
Software Dependencies No The paper mentions 'Open AI Gym' and refers to models like 'DDPM' and 'Progressive Distillation', but it does not specify version numbers for any programming languages, libraries, or software dependencies required for reproduction (e.g., PyTorch, TensorFlow, Gym versions).
Experiment Setup Yes We set all parameters as default in their papers when training both SA-DQN and Woca R-DQN. For training our pessimistic DQN algorithm with PF-RNN-based belief (called BP-DQN, see Algorithm 4 in Appendix E), we set κp = |Mt| = 30, i.e., the PF-RNN model will generate 30 belief states in each time step. For training our pessimistic DQN algorithm with diffusion (called DP-DQN, see Algorithm 6 in Appendix E), we set κd = |Mt| = 4, that is, the diffusion model generates 4 purified belief states from a perturbed state. For DP-DQN-O, we set the number of reverse steps to k = 10 for ϵ = 1/255 or 3/255 and k = 30 for ϵ = 15/255, and do not add noise ϕ when training and testing DP-DQN-O. For DP-DQN-F, we set k = 1, sampler step to 64, and add random noise with ϵϕ = 5/255 when training DP-DQN-F. We sample C = C = 30 trajectories to train PF-RNN and diffusion models. All other parameters are set as default for training the PF-RNN and diffusion models. For all other baselines, we train 1 million frames for the continuous Gridworld environment and 6 million frames for the Atari games. For our methods, we take the pre-trained vanilla DQN model, and train our method for another 1 million frames.