Single-Trajectory Distributionally Robust Reinforcement Learning

Authors: Zhipeng Liang, Xiaoteng Ma, Jose Blanchet, Jun Yang, Jiheng Zhang, Zhengyuan Zhou

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 4. Experiments We demonstrate the robustness and sample complexity of our DRQ algorithm in the Cliffwalking environment (Del etang et al., 2021) and American put option environment (deferred in Appendix A). These environments provide a focused perspective on the policy and enable a clear understanding of the key parameters effects. We develop a deep learning version of DRQ and compare it with practical online and offline (robust) RL algorithms in classical control tasks, Lunar Lander and Cart Pole.
Researcher Affiliation Collaboration 1Department of Industrial Engineering and Decision Analytics, Hong Kong University of Science and Technology 2Department of Automation, Tsinghua University 3Department of Management Science and Engineering, Stanford University 4Department of Mathematics, Hong Kong University of Science and Technology 5Stern School of Business, New York University 6Arena Technologies. Correspondence to: Zhengyuan Zhou <zhengyuanzhou24@gmail.com>.
Pseudocode Yes Algorithm 1 Distributionally Robust Q-learning with Cressie-Read family of f-divergences
Open Source Code No The paper does not provide a specific repository link or explicit statement about the release of source code for the described methodology.
Open Datasets Yes The Cliffwalking task is commonly used in risk-sensitive RL research (Del etang et al., 2021).
Dataset Splits No The paper describes training and evaluation within dynamic environments (e.g., Cliffwalking, Lunar Lander, Cart Pole) and does not specify traditional train/validation/test dataset splits with percentages or counts.
Hardware Specification No The paper does not provide specific hardware details such as exact GPU/CPU models, processor types, or memory amounts used for running its experiments.
Software Dependencies No The paper mentions software and environments like 'Open AI Gym' and 'DQN algorithm' but does not provide specific version numbers for these or any other software dependencies.
Experiment Setup Yes We set the stepsize parameters according to Assumption B.1: ζ1(t) = 1/(1+(1 γ)t0.6), ζ2(t) = 1/(1+0.1(1 γ)t0.8), and ζ3(t) = 1/(1 + 0.05(1 γ) t), where the discount factor is γ = 0.9. Most of the hyperparameters are set the same for both Lunar Lander and Cart Pole. We choose Cressie-Read family parameter k = 2, which is indeed the χ2 ambiguity set and we set ambiguity set radius as ρ = 0.3. For RFQI we also use the same ρ for fair comparison. Our replay buffer size is set 1e6 and the batch size for training is set 4096. Our fast Q and η network are update every 10 steps (Ftr = 10) and the target networks are updated every 500 steps (Fup = 500). The learning rate for Q network is 2.5 10 4 and for η network is 2.5 10 3. The Q network and the η network both employ a dual-layer structure, with each layer consisting of 120 dimensions. For exploration scheme, we choose epsilon-greedy exploration with linearly decay epsilon with ending ϵEnd.