On the Role of Discount Factor in Offline Reinforcement Learning

Authors: Hao Hu, Yiqin Yang, Qianchuan Zhao, Chongjie Zhang

ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We empirically verify the above theoretical observation with tabular MDPs and standard D4RL tasks. The results show that the discount factor plays an essential role in the performance of offline RL algorithms, both under small data regimes upon existing offline methods and in large data regimes without other conservative methods.
Researcher Affiliation Academia 1Institute of Interdisciplinary Information Sciences, Tsinghua University, Beijing, China 2Department of Automation, Tsinghua University, Beijing, China. Correspondence to: Hao Hu <hu-h19@mails.tsinghua.edu.cn>, Yiqing Yang <yangyiqi19@mails.tsinghua.edu.cn>.
Pseudocode Yes Algorithm 1 Pessimistic Value Iteration; Algorithm 2 Model-Based Pessimistic Policy Optimization; Algorithm 3 Generalized Value Iteration
Open Source Code No The paper mentions using "author-provided implementation or the recognized code" for existing algorithms but does not provide specific links or statements about making its own code open-source.
Open Datasets Yes We empirically verify the two effects on both tabular MDPs and the standard D4RL benchmark (Fu et al., 2020).
Dataset Splits No The paper describes its training datasets but does not explicitly specify exact percentages, sample counts, or refer to predefined splits for training, validation, and test sets. It focuses on the composition of the training data.
Hardware Specification No The paper describes running experiments but does not provide any specific details regarding the hardware (e.g., GPU models, CPU types, memory) used for these experiments.
Software Dependencies No The paper mentions using existing offline RL algorithms (e.g., TD3+BC, BCQ, COMBO, SAC) and states using author-provided or recognized code. However, it does not specify any software names with version numbers (e.g., Python, PyTorch, TensorFlow versions, or specific library versions).
Experiment Setup Yes The noised trajectories are fragments of the random datasets in D4RL... The proportion of masked state-action pairs is 0.5, and the noise ratio coefficients are {4%, 6%, 8%, 12%}... Most scenarios in Table 2 adopt γ = 0.95 as a lower discount factor other than BCQ (γ = 0.9) and COMBO (γ = 0.9) in hopper tasks. We select γ = 0.95 in pen-expert-v0, hammer-expert-v0 and relocate-expert-v0 tasks. We select lower γ = 0.9 in door-expert-v0 tasks.