Single-Trajectory Distributionally Robust Reinforcement Learning
Authors: Zhipeng Liang, Xiaoteng Ma, Jose Blanchet, Jun Yang, Jiheng Zhang, Zhengyuan Zhou
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 4. Experiments We demonstrate the robustness and sample complexity of our DRQ algorithm in the Cliffwalking environment (Del etang et al., 2021) and American put option environment (deferred in Appendix A). These environments provide a focused perspective on the policy and enable a clear understanding of the key parameters effects. We develop a deep learning version of DRQ and compare it with practical online and offline (robust) RL algorithms in classical control tasks, Lunar Lander and Cart Pole. |
| Researcher Affiliation | Collaboration | 1Department of Industrial Engineering and Decision Analytics, Hong Kong University of Science and Technology 2Department of Automation, Tsinghua University 3Department of Management Science and Engineering, Stanford University 4Department of Mathematics, Hong Kong University of Science and Technology 5Stern School of Business, New York University 6Arena Technologies. Correspondence to: Zhengyuan Zhou <zhengyuanzhou24@gmail.com>. |
| Pseudocode | Yes | Algorithm 1 Distributionally Robust Q-learning with Cressie-Read family of f-divergences |
| Open Source Code | No | The paper does not provide a specific repository link or explicit statement about the release of source code for the described methodology. |
| Open Datasets | Yes | The Cliffwalking task is commonly used in risk-sensitive RL research (Del etang et al., 2021). |
| Dataset Splits | No | The paper describes training and evaluation within dynamic environments (e.g., Cliffwalking, Lunar Lander, Cart Pole) and does not specify traditional train/validation/test dataset splits with percentages or counts. |
| Hardware Specification | No | The paper does not provide specific hardware details such as exact GPU/CPU models, processor types, or memory amounts used for running its experiments. |
| Software Dependencies | No | The paper mentions software and environments like 'Open AI Gym' and 'DQN algorithm' but does not provide specific version numbers for these or any other software dependencies. |
| Experiment Setup | Yes | We set the stepsize parameters according to Assumption B.1: ζ1(t) = 1/(1+(1 γ)t0.6), ζ2(t) = 1/(1+0.1(1 γ)t0.8), and ζ3(t) = 1/(1 + 0.05(1 γ) t), where the discount factor is γ = 0.9. Most of the hyperparameters are set the same for both Lunar Lander and Cart Pole. We choose Cressie-Read family parameter k = 2, which is indeed the χ2 ambiguity set and we set ambiguity set radius as ρ = 0.3. For RFQI we also use the same ρ for fair comparison. Our replay buffer size is set 1e6 and the batch size for training is set 4096. Our fast Q and η network are update every 10 steps (Ftr = 10) and the target networks are updated every 500 steps (Fup = 500). The learning rate for Q network is 2.5 10 4 and for η network is 2.5 10 3. The Q network and the η network both employ a dual-layer structure, with each layer consisting of 120 dimensions. For exploration scheme, we choose epsilon-greedy exploration with linearly decay epsilon with ending ϵEnd. |